For three weeks our onboarding agent silently discarded every job it was supposed to persist, and the logs showed nothing. No exception, no warning, no failed write. The checkpointer reported success on every node. The state just wasn't there on resume. I rewrote the same LangGraph pipeline three times before I understood that the framework was doing exactly what I told it to — which was the wrong thing, quietly, at scale.
This is the post I wanted when I was staring at a checkpoint table full of rows that all looked correct and a system that behaved as if none of them existed. If you are building LangGraph stateful agents for production checkpointing and assuming the defaults will save you, they will not. Here is what broke, why, and the one pattern that finally held.
The state schema mismatch that ate three weeks
We run multi-agent systems on Oracle Cloud — Telegram and WhatsApp front ends, a router that splits work between Groq for cheap fast turns and Claude for the reasoning-heavy ones. The agent in question handled a five-step onboarding flow: collect intent, classify, enrich, draft, confirm. Each step wrote to a shared graph state. Checkpointing into Postgres so a user could close WhatsApp, come back in two days, and resume mid-flow.
The bug: every resume started from scratch. New users, returning users, everyone got step one again.
The cause was my state schema. In LangGraph your state is a TypedDict (or Pydantic model) and every node returns a partial update that gets merged. I had defined something like:
class OnboardingState(TypedDict):
messages: list
profile: dict
step: str
The problem is the default merge behavior. LangGraph merges channel updates using reducers. If you do not specify a reducer, the default for a field is overwrite — the new node's value replaces the old. That is fine for step. It is catastrophic for messages, where I assumed appending. So each node returning {"messages": [new_msg]} overwrote the entire history with a single message. The checkpoint saved that single-message state perfectly. On resume, the graph loaded a profile dict that one node had reset to {} because it returned a partial profile and overwrote the full one.
Nothing errored because every write was technically valid. The schema accepted a dict. It accepted a list. It just kept the wrong one.
The fix is Annotated fields with explicit reducers:
from typing import Annotated
from operator import add
def merge_profile(existing: dict, update: dict) -> dict:
return {**existing, **update}
class OnboardingState(TypedDict):
messages: Annotated[list, add]
profile: Annotated[dict, merge_profile]
step: str # overwrite is correct here
Once I made the merge semantics explicit per field, the silent discard stopped. The lesson is brutal in its simplicity: in LangGraph, state behavior is a property of the schema, not the node logic. If you reason about your data flow by reading node functions, you will be wrong. The schema is the contract.
Three weeks lost because I never asked what "merge" meant. I assumed it meant append. It meant replace.
Checkpoint corruption and the resume that wouldn't
Rewrite number two. Schema fixed, flow stable in testing. Then under real WhatsApp load we started seeing resumes fail with deserialization errors. The relevant one, paraphrased from our Oracle Postgres logs:
TypeError: Object of type AIMessage is not JSON serializable
and intermittently:
psycopg.errors.UniqueViolation: duplicate key value violates unique constraint "checkpoints_pkey"
Two separate problems wearing the same costume of "resume broken."
The serialization error came from storing LangChain message objects directly. The default JsonPlusSerializer handles most LangChain types, but we had a custom message subclass carrying routing metadata (which model handled it, Groq or Claude, token count, latency). That custom field broke the serializer. The checkpoint write would partially succeed, leaving a row with a corrupted blob. Reading it back threw on deserialize. So we had checkpoint rows that existed, passed a row-count check, and were unusable.
The duplicate key violation was worse and more interesting. LangGraph's checkpointer keys on thread_id plus checkpoint id plus namespace. We were generating thread_id from the WhatsApp phone number. Fine — until two messages from the same user arrived within the same execution window because the webhook retried. WhatsApp retries webhooks aggressively if you do not return 200 fast enough, and our enrichment step against Claude sometimes took four to six seconds. So the same message triggered two concurrent graph runs writing to the same thread. Both tried to write checkpoint id 0. One won, one threw, and the loser's partial state sometimes landed first.
The fixes, in order of how much they mattered:
1. Strip custom objects before checkpointing. Routing metadata moved out of the message object into a plain dict field in state. Messages stayed standard LangChain types. The serializer stopped choking. Cost: a small refactor. Benefit: zero corrupt rows since.
2. Webhook idempotency before the graph. We return 200 immediately, enqueue the message with its WhatsApp message id as a dedup key, and process from the queue. Duplicate webhooks get dropped before they ever reach LangGraph. This killed the concurrent-write race outright. The graph should never be your concurrency control layer — it is not built for it.
3. One run per thread at a time. We added a lightweight advisory lock in Postgres keyed on thread_id. If a run is in progress for a thread, the next message waits. For a conversational agent this is correct behavior anyway — you do not want two replies racing.
The pattern that finally made multi-step pipelines stable
Rewrite three was where it clicked, and the clicking was mostly about deleting things.
The pattern: small graphs, explicit checkpoints at boundaries, and idempotent nodes. Every word there earned its place.
Small graphs. My first design was one mega-graph with twelve nodes and conditional edges everywhere. Debugging it meant tracing state mutations across the whole thing. I broke it into three subgraphs — intake, processing, confirmation — each checkpointed independently. A failure in processing no longer rolls back intake. The blast radius of any bug shrank to one subgraph.
Explicit checkpoints at boundaries. LangGraph checkpoints after every node by default with a checkpointer attached. That sounds safe and is mostly wasteful. We checkpoint at meaningful resume points — after intake completes, after each expensive model call, before the human confirmation wait. Between those, in-memory is fine. This cut our Postgres write volume by roughly 60% and, more importantly, made the checkpoints mean something. A checkpoint should represent a state a user might actually resume from, not an internal step that has no resume semantics.
Idempotent nodes. This is the one that matters most and the one nobody tells you. Because webhooks retry, because runs get interrupted, because Oracle reschedules a container occasionally, any node can run twice on the same state. So every node must produce the same result if run twice with the same input. The Claude enrichment node, for example, checks whether enrichment already exists in state before calling the model. If profile["enriched_at"] is set, it returns early. That one check saved real money — Claude calls are our largest variable cost, and before idempotency, retries were silently doubling them.
Here is the skeleton of the stable version:
from langgraph.graph import StateGraph
from langgraph.checkpoint.postgres import PostgresSaver
def enrich(state: OnboardingState) -> dict:
if state["profile"].get("enriched_at"):
return {} # idempotent: already done, no-op
result = call_claude(state["profile"])
return {"profile": {**result, "enriched_at": now()}}
graph = StateGraph(OnboardingState)
graph.add_node("enrich", enrich)
# ... edges ...
checkpointer = PostgresSaver(pool)
checkpointer.setup()
app = graph.compile(checkpointer=checkpointer)
The return {} from an already-completed node is the whole trick. Combined with explicit reducers from rewrite one, it means a re-run never corrupts state and never double-charges.
What I would tell my three-weeks-ago self
LangGraph is a good library being used by people who assume it behaves like their mental model of a state machine. It does not. It behaves like a set of merge operations over channels, persisted by a serializer, keyed by a thread id. Every one of my bugs lived in the gap between those two descriptions.
Concretely:
- Read the reducer semantics before you write a single node. Assume nothing about merge behavior. The default is overwrite, and it will silently eat your data.
- Do not store custom objects in state if you want them checkpointed. Plain dicts and standard LangChain types serialize cleanly. Everything else is a future corruption.
- Put concurrency control outside the graph. Idempotent webhook ingestion plus a thread-level lock. The graph is execution, not coordination.
- Checkpoint at resume boundaries, not at every node. Fewer writes, clearer semantics, lower Postgres load.
- Make every node idempotent. It is the difference between a retry being free and a retry costing you a Claude call and a corrupt row.
We run this in production now across both messaging channels with zero VC money, on Oracle's free-then-cheap tier, routing the cheap turns to Groq and reserving Claude for the steps that justify the cost. The system survives webhook storms, container reschedules, and users who vanish for a week and come back mid-flow. It did not get there by being clever. It got there by being explicit about the three things LangGraph does implicitly and wrong-by-default for my use case.
The framework did not fail me. My assumptions did. The schema was always the contract — I just had not read it.
Frequently Asked Questions
Q: Is the default overwrite reducer ever the right choice for list fields?
A: Yes — for fields like current_options or last_tool_calls where each node fully replaces the value, overwrite is correct and add would accumulate stale entries. The rule is: append when the field is a log or history, overwrite when it is a snapshot. The mistake is not picking overwrite, it is never deciding and letting the default decide for you.
Q: Why PostgresSaver on Oracle instead of the SQLite or in-memory checkpointer?
A: In-memory loses everything on container restart, and our Oracle containers do get rescheduled. SQLite does not survive multiple processes writing concurrently, which we have under load. PostgresSaver with a connection pool handles concurrent threads and survives restarts; the cost is one managed Postgres instance, which on Oracle's tier runs us effectively nothing at our volume.
Q: How do you handle a node that genuinely can't be made idempotent, like sending a message?
A: Move the side effect's dedup key into state and check it. Before sending a WhatsApp reply, we write the intended outbound_message_id to state, then send, then mark it sent. On re-run, if sent is true we skip. The send itself is not idempotent, but the decision to send is, which is enough.
Q: Did breaking one graph into three subgraphs hurt latency?
A: Marginally — about 40–80ms extra per boundary from the additional checkpoint write, which is noise next to a 4-second Claude call. The debugging and failure-isolation gains paid for it many times over. If your nodes are all sub-100ms and you have no expensive external calls, the tradeoff shifts and a single graph may be fine.
Q: How did you detect the silent state discard given there were no errors?
A: We added a structured log line at the start of every node dumping the keys present in state and the message count. The discard showed up instantly: message count was always 1 on entry to step two, never growing. Without that observability we would still be guessing. Log your state shape, not just your control flow.