We discarded roughly three weeks of agent jobs without a single error in the logs. No exception, no failed checkpoint, no alert. The state graph ran clean, returned a valid response to the user on Telegram, and silently dropped half the accumulated context every single invocation. I found it because a user asked our agent to "continue where we left off" and it had no idea what "left off" meant — even though the checkpoint table had rows. The rows were there. The data inside them was wrong. That is the worst kind of bug: the system that tells you everything is fine.
This is what it actually took to get LangGraph stateful agents production checkpointing to a place I trust on Oracle Cloud, running real WhatsApp and Telegram traffic, routing between Groq and Claude. Three rewrites. The third one finally held.
Rewrite one: the state schema mismatch that swallowed everything
The first version looked correct. We used a TypedDict state with a messages field and a few accumulator fields for retrieved context and tool outputs. The graph compiled. Nodes fired in order. The agent answered questions.
The problem was the reducer. In LangGraph, your state schema isn't just a type hint — each field can have a reducer function that decides how new values merge into existing state. If you don't specify one, the default behavior replaces the field wholesale on every node return. We had written this:
class AgentState(TypedDict):
messages: list[BaseMessage]
retrieved_context: list[str]
tool_calls: list[dict]
Plain list. No Annotated, no reducer. So when a node returned {"retrieved_context": [new_chunk]}, LangGraph didn't append — it overwrote the entire list with a single-element list. Every retrieval step erased the previous one. By the time we reached the synthesis node, retrieved_context held exactly one chunk: the last one. The agent answered using a fraction of what it had gathered.
Why no error? Because a list of one is a perfectly valid list. The type checked. The graph ran. The checkpoint saved the one-element list faithfully. Everything downstream of the bug behaved correctly given the wrong input. We only noticed because the quality of answers was mediocre in a way we couldn't explain — and "mediocre answers" doesn't page anyone at 2am.
The fix is one line of annotation per field that should accumulate:
from typing import Annotated
from operator import add
from langgraph.graph.message import add_messages
class AgentState(TypedDict):
messages: Annotated[list[BaseMessage], add_messages]
retrieved_context: Annotated[list[str], add]
tool_calls: Annotated[list[dict], add]
add_messages is the built-in reducer that handles message dedup and ID matching. For plain accumulation, operator.add works. The lesson that cost us three weeks: in LangGraph, the absence of a reducer is itself a decision, and it's almost never the one you want for list state. Audit every field. If you can't say out loud whether it should replace or accumulate, you have a bug waiting.
Rewrite two: checkpoint corruption under concurrency
Once state merged correctly, we shipped to a small WhatsApp cohort. The second failure mode showed up under load: two messages from the same user arriving within a second of each other — common on mobile, where people send a thought, then a correction.
We were using the Postgres checkpointer against our Oracle-hosted database, keyed by thread_id (the user's conversation). Both messages spawned graph runs against the same thread. Both read the checkpoint at version N. Both computed their updates. Both wrote back. The second write clobbered the first, and the checkpoint metadata ended up referencing a parent version that the writes had skipped over. Result: a thread whose checkpoint chain had a hole in it. When LangGraph tried to resume, it walked the parent pointers and hit a version that no longer reconciled. The error we eventually surfaced:
ValueError: Checkpoint parent_config references checkpoint_id
'1ef...' not found for thread '<wa_id>'
That one does page you, which I was almost grateful for after the silent disaster of rewrite one.
The naive instinct is to slap a lock around the whole graph run. Don't. A graph run that calls Claude can take eight to twelve seconds; holding a row lock that long under WhatsApp burst traffic just moves the failure to connection-pool exhaustion. We saw exactly that — Oracle's connection limits on our tier are not generous, and a pile of runs each holding a lock and waiting on a model API drained the pool inside a minute.
What worked was a two-part change. First, serialize per-thread at the application layer, not the database layer: a lightweight per-thread_id async lock so that two messages for the same conversation queue instead of racing, while different conversations stay fully parallel. Second, debounce inbound messages. If a user sends three fragments in two seconds, we coalesce them into a single graph invocation. This isn't a workaround — it's correct product behavior. A human reads three quick messages as one thought. So should the agent.
# per-thread serialization, not global
thread_locks: dict[str, asyncio.Lock] = defaultdict(asyncio.Lock)
async def run_for_thread(thread_id: str, payload):
async with thread_locks[thread_id]:
config = {"configurable": {"thread_id": thread_id}}
return await graph.ainvoke(payload, config)
For multi-process deployment this lock has to be distributed — we moved to a Redis-backed lock with a short TTL once we ran more than one worker. The TTL matters: set it longer than your slowest model call plus checkpoint write, or you'll release mid-run and reintroduce the race. We use 30 seconds, which covers our Claude tail latency with margin.
Rewrite three: the pattern that finally held
The third rewrite wasn't about a new bug. It was about admitting the architecture was wrong. We had one giant graph that did routing, retrieval, tool calls, and synthesis, all sharing one fat state object. Every change to one node risked the checkpoint shape, and a checkpoint shape change is a migration problem when you have live threads persisted in Postgres.
The pattern that fixed it: a thin, stable supervisor graph with a deliberately minimal checkpointed state, and stateless subgraph workers that do not persist.
The supervisor holds only what must survive a crash and restart: the message history, a compact running summary, and a routing decision field. That's it. The heavy intermediate state — raw retrieved chunks, partial tool outputs, the scratch space a synthesis step needs — lives inside subgraph executions that run to completion within a single supervisor step and return only their distilled result. Those subgraphs are not checkpointed. If a subgraph crashes mid-run, the supervisor's last good checkpoint is intact and we retry the step cleanly, because the step is idempotent against the supervisor state.
This separation does three things that matter in production:
- The checkpointed schema barely changes. I can rewrite a retrieval subgraph completely without touching the persisted state shape, so I don't need a checkpoint migration for live threads. The thing that's hardest to evolve is now the thing that changes least.
- Checkpoints stay small. Our supervisor checkpoint rows are a few KB. The earlier fat-state version wrote checkpoints that occasionally crossed 100KB because raw retrieved chunks were sitting in persisted state. That's a real cost on storage and on every read-modify-write cycle.
- Model routing lives in one obvious place. The supervisor decides Groq versus Claude. Cheap, latency-sensitive classification and routing — "is this a question, a correction, or chitchat" — goes to Groq, where the response comes back fast enough that the user doesn't feel it. Anything requiring real synthesis goes to Claude. Keeping that decision in the stateful supervisor rather than scattered across subgraphs means I can change the routing policy in one function and reason about its cost.
Here's the shape:
# supervisor: stateful, checkpointed, minimal
class SupervisorState(TypedDict):
messages: Annotated[list[BaseMessage], add_messages]
summary: str # replaced each turn, intentionally
route: str # replaced each turn
def route_node(state):
decision = classify_with_groq(state["messages"][-1])
return {"route": decision}
def work_node(state):
# subgraph runs to completion, returns only distilled output
result = retrieval_subgraph.invoke({"query": last_user(state)})
answer = synthesize_with_claude(state, result["summary"])
return {"messages": [AIMessage(answer)],
"summary": update_summary(state["summary"], answer)}
Note summary and route use the default replace behavior on purpose — they should overwrite each turn. That's the discipline from rewrite one applied deliberately: replace where replace is correct, accumulate where accumulation is correct, and never leave it to accident.
What I'd tell anyone shipping LangGraph stateful agents to production
Get the checkpointer right before anything else. The in-memory MemorySaver is fine for a notebook and a trap for production — your state evaporates on restart and you'll build a demo that can't survive a deploy. Go straight to the Postgres (or SQLite, for single-node) checkpointer and treat your database as part of the agent, not an afterthought. Run setup() on the checkpointer in a migration step, not lazily on first request; the lazy path races under concurrent cold starts and you'll see duplicate-table or missing-table errors that look like infrastructure flakiness but aren't.
Test resumption explicitly. Most agent test suites run a graph start to finish in one process and call it green. That tells you nothing about checkpointing. Write a test that runs half the graph, throws away the in-memory objects entirely, reconstructs the graph from cold, resumes from the persisted checkpoint with the same thread_id, and asserts the final state is correct. If that test passes, your checkpointing works. If you don't have that test, you don't know whether it does.
Watch checkpoint size as a first-class metric. We alert when a thread's checkpoint row crosses 50KB, because growth there is the early signal that some intermediate state is leaking into persistence. It's the canary for the rewrite-one and rewrite-three mistakes both.
And keep the persisted state as small as you can defend. Every field in your checkpointed schema is a thing you're promising to migrate later. The fat-state design felt convenient for about two weeks and then became the reason every change was scary. The thin supervisor is less elegant on a whiteboard and far cheaper to live with.
Frequently Asked Questions
Q: Postgres checkpointer vs. building your own state persistence — is the built-in worth the lock-in?
A: Use the built-in. We considered rolling our own and the only real advantage was schema control, which the thin-supervisor pattern solves anyway. The built-in checkpointer gives you the parent-pointer chain and resume logic for free, and reimplementing that correctly under concurrency is exactly where we'd have introduced the same corruption bugs by hand. The lock-in is your state schema, not the checkpointer library — keep the schema small and you can migrate off either way.
Q: How do you actually handle a checkpoint schema migration on live threads?
A: You mostly avoid it by keeping the checkpointed state minimal, which is the whole point of the supervisor pattern. When you can't avoid it, version the state explicitly with a schema_version field, and write a reducer-aware migration that reads old checkpoints and rewrites them lazily on next access rather than in one big batch. We've done a forced migration exactly once and the lazy approach meant zero downtime — threads that were never touched again were never migrated, and nobody noticed.
Q: Groq for routing and Claude for synthesis — what's the real latency and cost split?
A: Routing classification on Groq returns in roughly 200-400ms and costs a fraction of a cent per call, so the user never feels the routing hop. Synthesis on Claude is where the time and money go — multi-second responses and the bulk of our model spend. The split matters because routing happens on every single message while synthesis only fires when there's actual work, so putting the high-frequency cheap decision on the fast model keeps both latency and bill down.
Q: Why debounce inbound messages instead of just fixing the locking?
A: We did both, but debouncing is the one that improved answer quality, not just stability. Three message fragments processed as three separate graph runs produce three separate, context-blind responses; coalesced into one invocation they produce a single coherent answer. The locking prevents corruption; the debounce makes the agent behave like it's actually listening. On WhatsApp specifically, where fragmented messages are the norm, this was a product fix disguised as an infra fix.
Q: Should subgraphs ever be checkpointed?
A: Only if a subgraph is genuinely long-running and needs to survive a crash mid-execution — for example, a multi-minute batch job. For interactive agents where a subgraph completes inside one supervisor step in seconds, checkpointing it adds storage cost, schema fragility, and resume complexity for no benefit. Default to stateless subgraphs and make them idempotent so a retry from the supervisor's checkpoint is always safe.