For three weeks, our intake agent silently discarded every job it processed. No error. No exception. The Telegram bot replied "Got it, working on that" and then nothing landed in the queue. Logs were clean. The graph executed end to end. We lost roughly 1,100 user requests before I traced it to a single typo in a TypedDict field name — task_result written where the downstream node read result. LangGraph merged the state update, found no key collision, and threw the data into the void with zero complaint.
That bug is the entire reason this article exists. LangGraph stateful agents production checkpointing is sold as a solved problem in the docs — wire up a checkpointer, get durable execution, resume from failure. The reality is that the failure modes are quiet, the state model is unforgiving, and the gap between a tutorial graph and a graph that survives a week of real Telegram and WhatsApp traffic is three rewrites wide. Here is what each one taught me.
Rewrite one: the state schema that ate everything
The first version followed the canonical pattern. A TypedDict state, a few nodes, conditional edges. It worked in the notebook. It worked in the local test. It deployed to our Oracle Cloud VM and immediately started losing data — but only under concurrent load, which is exactly when you stop watching the logs line by line.
The root cause was how LangGraph reducers handle state merges. When a node returns a partial state dict, LangGraph merges it into the channel using the reducer you defined — or the default, which is overwrite. If two nodes write to the same key and you expected accumulation, you get the last writer. If a node writes a key that no other node reads, the value sits in state until it's overwritten and nobody notices it's orphaned.
My intake node returned {"task_result": parsed}. My persistence node read state["result"], found the key missing, and the default value was an empty list. Empty list, nothing to persist, return success. The graph reported completion. Python never raised because TypedDict is a typing fiction at runtime — it does not validate keys, it does not enforce shape, it does nothing but make your editor happy.
What fixed it:
- Switched the state from
TypedDictto a Pydantic model withextra="forbid". Now a write to an undeclared field raises at the node boundary instead of vanishing. - Added explicit
Annotatedreducers for every channel that should accumulate. If I want messages to append, I say so:Annotated[list, operator.add]. Everything else overwrites by design, not by accident. - Wrote a single assertion node at the end of every pipeline that checks the invariants — "if status is
complete,resultmust be non-null." That one node would have caught the three-week bug on day one.
The lesson is blunt: LangGraph's state model gives you freedom and charges you for it. The default behavior is silent. You pay the bill in lost data unless you build the guardrails yourself.
Rewrite two: checkpoint corruption and the resume that wasn't
Once the state was honest, I turned on checkpointing. The promise is real and it matters: a long multi-step pipeline — say, an agent that routes a user request through Groq for cheap classification, then Claude for the heavy reasoning, then a tool call, then persistence — can crash halfway and resume from the last completed node instead of re-running everything. With Claude tokens costing what they cost, re-running a five-step pipeline from scratch on every transient failure is money you set on fire.
I started with the SQLite checkpointer because it's the default everyone reaches for. On a single VM that handles WhatsApp and Telegram agents, SQLite under concurrent writes is a trap. I hit database is locked errors within the first hundred concurrent threads. Worse, I hit a checkpoint that resumed into a half-written state — the checkpoint had been written mid-update when the process was killed by an OOM, and on resume the graph loaded a state where one channel reflected the new step and another reflected the old one.
That is checkpoint corruption, and it does not announce itself either. The graph resumes, runs the next node against inconsistent state, and produces a plausible-looking wrong answer. A user got a WhatsApp reply that referenced a request they never made, because the resumed thread had mixed two checkpoints.
What I changed:
- Moved off SQLite to Postgres on Oracle (the
PostgresSaver). Concurrent writes stopped being a lottery. Thedatabase is lockederrors went to zero because Postgres handles row-level locking instead of locking the whole file. - Stopped trusting that a checkpoint write is atomic with the node side effect. It isn't. If your node calls an external API and then the checkpoint commits, a crash in between means you re-run the API call on resume. For non-idempotent calls — sending a message, charging something, posting to a channel — that's a duplicate. I made every side-effecting node idempotent with an explicit dedup key stored in state before the call.
- Added a checkpoint sanity check on resume: load the checkpoint, run the same invariant assertions, and if they fail, discard the checkpoint and restart the thread from the beginning rather than resuming into corruption.
The hard number here: moving to Postgres and adding the resume-time invariant check dropped our "agent produced incoherent reply" incidents from roughly 8 per week to under 1. The remaining ones were prompt problems, not state problems.
The routing problem checkpointing made worse
There's a second-order effect nobody warns you about. We route between Groq and Claude based on the task — Groq's Llama models for classification, formatting, and anything where latency matters more than depth; Claude for reasoning that actually needs the quality. The routing decision lives in the graph state.
When you resume from a checkpoint, the routing decision is frozen in that checkpoint. If you've since changed your routing logic — say you moved a task class from Claude to Groq to cut cost — the resumed thread runs the old route, because the route is data in the checkpoint, not code re-evaluated on resume.
I learned this when a batch of resumed threads kept billing Claude after I'd explicitly moved their task class to Groq. The fix was to stop storing the resolved model in state and instead store only the task class, then resolve the model fresh in the node every time it runs. Decisions that depend on code you'll change should not be frozen into the checkpoint. Store the inputs to the decision, not the output.
This is a general rule for LangGraph stateful agents production checkpointing: the checkpoint is a time capsule. Anything you put in it, you are committing to support across every code change until that thread completes or expires. Keep it minimal. Store the durable facts, recompute the derivable ones.
Rewrite three: the pattern that finally held
The third rewrite wasn't a new framework or a clever trick. It was a discipline I should have started with. I split every pipeline into nodes that obey three rules:
1. Every node is a pure function of state plus one external effect, at most. No node both calls Claude and writes to Postgres and sends a Telegram message. One side effect per node. This means more nodes, but each one is independently resumable and independently idempotent. When something fails, I know exactly which effect to make safe.
2. Side effects come after the dedup check, not before. Each side-effecting node first checks state for a dedup key. If the key is present, it skips the effect and returns. The key gets written to state in the same update as the effect's result. On resume, the dedup key tells me the effect already happened.
3. The state schema is validated on entry and exit of the graph, and on every checkpoint resume. Pydantic model, extra="forbid", explicit reducers, an invariant assertion node. The graph cannot run with a malformed state — it raises loudly at the boundary.
Here's the shape of a node under this pattern:
def send_reply_node(state: AgentState) -> dict:
if state.reply_sent_id is not None:
# already sent on a prior run before crash; skip
return {}
msg_id = telegram.send(state.chat_id, state.result)
return {"reply_sent_id": msg_id}
It's not elegant. It's defensive. But it survived the load that broke the first two versions. Across our WhatsApp and Telegram agents on Oracle, pipeline completion rate went from "I don't actually know" — because the silent-discard bug meant my metrics were lying — to a measured 99.3%, with the remaining 0.7% being upstream API timeouts that resume cleanly.
What I'd tell someone starting today
Don't reach for the SQLite checkpointer beyond your laptop. It will work in every test and fail under the exact concurrency that production brings. Go straight to Postgres if you have any real traffic — and on Oracle's free tier you can run a perfectly adequate Postgres instance for this, which is what we do at zero infra cost.
Don't use TypedDict for state if you care about not losing data. It validates nothing. Use Pydantic with strict mode. The performance cost is irrelevant next to a three-week silent failure.
Don't trust that "durable execution" means your side effects are safe. The checkpoint and the side effect are not transactionally linked. Make every external call idempotent with a key you store in state, or accept that resume means duplicates.
And don't store derived decisions in the checkpoint. The route you picked, the model you chose, the timestamp-based branch — store the inputs, recompute the output. Your code will change before that thread expires, and the checkpoint will quietly serve you stale logic.
LangGraph is genuinely good once you've internalized that it does very little for you by default and charges you silently when you assume otherwise. The graph abstraction is sound. The state and checkpoint machinery is sharp on every edge. Three rewrites was the tax. I'm writing this so yours is one.
Frequently Asked Questions
Q: Why not just use the MemorySaver or SQLite checkpointer in production if traffic is moderate?
A: SQLite serializes writes through a file lock, and under concurrent threads you hit database is locked somewhere between 50 and a few hundred simultaneous writes depending on your disk. MemorySaver loses everything on restart, defeating the point of checkpointing. PostgresSaver on a free-tier Oracle instance costs nothing extra and removes the entire class of lock errors — there's no reason to gamble.
Q: How do you handle a checkpoint written mid-update when a process is OOM-killed?
A: Don't assume the checkpoint and your node's side effect committed together — they didn't. On resume, re-run your state invariant assertions before continuing; if they fail, discard the checkpoint and restart the thread rather than resuming into inconsistent state. This dropped our incoherent-reply incidents from ~8/week to under 1.
Q: Does Pydantic state validation add meaningful latency per node?
A: For a state object with a few dozen fields, validation is sub-millisecond — utterly dwarfed by any LLM call, which is 200ms on Groq and several seconds on Claude. The "performance" objection to strict validation is theoretical. The cost of not validating is data that vanishes silently for weeks.
Q: Why store task class instead of the resolved model in checkpoint state?
A: Checkpoints are time capsules — whatever you freeze in them runs against your old logic on resume, even after you change the code. We had resumed threads billing Claude after we'd rerouted their task class to Groq, because the resolved model was frozen in the checkpoint. Store the decision inputs, recompute the decision in the node.
Q: One side effect per node sounds like it explodes the node count. Is it worth it?
A: It roughly doubled our node count on complex pipelines, yes. But each node became independently resumable and idempotent, which is what took completion rate to 99.3%. A node that calls an LLM, writes to a DB, and sends a message has three places to fail and no clean resume point — splitting them is what makes checkpointing actually deliver durable execution instead of duplicate side effects.