I spent six weeks building a web app for Spanish learning. It had a clean React frontend, a progress dashboard, spaced-repetition logic, the works. It got 100 free signups in the first two weeks and a 4% return rate by week three. Then I rebuilt the same thing as a WhatsApp agent in nine days, charged $9/month, and three people paid me. Those three taught me more about retention, memory, and what "AI language learning" actually means than the entire web app cohort combined.
This is the architecture of EspaLuz — a bilingual Spanish/English tutor that lives in WhatsApp and Telegram — and specifically how I built conversation memory that survives across sessions without paying for a managed vector store. If you're a founder deciding between a web app and a messaging agent, or a developer who keeps reaching for Pinecone the second someone says "memory," read the cost numbers before you commit.
The web app was a graveyard with good lighting
The dashboard was the problem, not the feature. Every web app for learning a language asks the user to do the same thing: leave their life, open a tab, log in, and "study." That's a context switch, and context switches are where retention dies. My 100 signups didn't churn because the lessons were bad. They churned because opening a dedicated tab to practice Spanish competes with email, Slack, and the actual reason they came to Panama — which is to live here, not to study.
WhatsApp doesn't ask for a context switch. It's already open. In Latin America it's not an app, it's the substrate — your landlord texts you on it, your kid's school sends homework on it, the guy fixing your AC confirms on it. When EspaLuz lives in that same thread, practicing Spanish isn't a separate activity. It's one more conversation in a list of conversations the user already checks 40 times a day.
The hard number: web app week-three return rate was 4%. WhatsApp agent week-three return rate, on a tiny paying cohort, was 100% — three of three. Small sample, yes. But the difference between 4% and 100% isn't a sample-size artifact. It's a delivery-channel artifact.
Two-layer memory without a paid vector store
Here's the part most developers get wrong. The moment someone says "the AI needs to remember the conversation," everyone reaches for embeddings and a vector database. For a language tutor, that's overkill that costs you money and adds latency for retrieval you mostly don't need.
EspaLuz uses two memory layers, and neither requires a managed vector store:
Layer 1 — Rolling conversation window. The last N messages of the actual dialogue, stored as plain rows in Postgres on Oracle Cloud. I keep roughly the last 20 turns per user. This is the working memory: what we just talked about, the verb tense the user struggled with three messages ago, the joke they made. No embeddings. Just a SELECT ... WHERE user_id = $1 ORDER BY created_at DESC LIMIT 20, reversed, and dropped into the prompt.
Layer 2 — Structured learner profile. A summarized, slowly-changing record of who this person is as a learner. Their level (A2, working toward B1), recurring mistakes (confuses ser and estar, drops the personal a), topics they care about (their kid's school, grocery vocabulary, talking to the landlord), and their goal. This isn't raw conversation — it's a distilled JSON blob, also a single row in Postgres, updated every few sessions by a cheap summarization pass.
When a message comes in, the prompt assembles both layers: profile blob plus rolling window plus the new message. That's the entire memory system. Total managed-vector-store cost: $0. Total infra cost beyond the Oracle instance I'm already running: $0.
When would you actually need embeddings?
I told you I'd give an answer instead of hedging, so here it is. You need a vector store when retrieval is over a large, mostly-static corpus the user didn't write — a documentation set, a legal library, a 400-page textbook. You're searching content.
A language tutor is the opposite case. You're not searching a corpus. You're maintaining state about one person across a bounded number of turns. The relevant context is almost always recent (the rolling window) or structurally stable (the profile). Semantic search over 18 months of someone's Spanish practice sounds powerful and is, in practice, a way to surface a random message from March when what matters is the mistake they made 90 seconds ago.
If EspaLuz grows to where a learner has thousands of sessions and I want "remember when we talked about ordering at the fish market three months ago," I'll add a third layer with embeddings — scoped to long-term episodic recall only, and only summaries, not raw turns. That's a feature for month 12, not the MVP. Building it on day one would have been money and latency spent solving a problem three paying users never had.
Conversation continuity across sessions is the whole product
The single thing that made paying users stay: EspaLuz remembered. Not in a marketing sense — in the literal sense that when a user came back two days later, the agent referenced what they'd been working on. "Last time you mixed up por and para when talking about your apartment. Let's try that again."
That's the profile layer plus the rolling window doing their job. The web app technically had this data too — it was in the progress dashboard. But nobody reads a dashboard. The difference is that in a conversation, continuity is delivered to you inside the dialogue, not parked behind a click. The AI language learning experience on WhatsApp works because conversation memory shows up as conversation, not as a chart.
The technical trap here is session boundaries. WhatsApp gives you no native session concept — every message is just a webhook. So "session" is something you define. I treat a gap of more than a few hours as a new session and trigger a lightweight re-greeting that pulls from the profile. That re-greeting is what makes the user feel remembered. It costs one extra cheap LLM call on the first message of a returning session.
Model routing: Groq for speed, Claude for the hard turns
EspaLuz runs on a routing layer, not a single model. Most messages — a vocabulary correction, a quick translation, a "how do I say X" — go to Groq-hosted Llama for sub-second responses. In a messaging context, latency is felt brutally. A web app user waits for a page; a WhatsApp user watching the "typing…" indicator for four seconds assumes the thing broke.
The harder turns route to Claude: when the user writes a paragraph and wants nuanced feedback, when the profile needs updating with a real summarization pass, or when the conversation gets emotionally loaded (a frustrated learner about to quit). Claude handles those better and the cost is justified because they're a minority of messages.
The routing decision is a cheap classifier on the inbound message plus some heuristics — message length, whether the user is asking for correction versus chatting, recent error density. Roughly 80% of traffic hits Groq, 20% hits Claude. That split keeps my per-user monthly model cost well under $1, which is what makes a $9/month price actually have margin after Twilio's WhatsApp messaging fees, which are the real cost line you should be watching — not the LLM.
What three paying users taught me that 100 free signups couldn't
Free signups give you vanity. They tell you a headline worked. They tell you nothing about whether the product survives contact with someone's actual life, because the user has no skin in it and no reason to come back.
Three paying users gave me specifics I could build on:
- One was a retiree who needed to talk to her doctor. Her entire vocabulary need was medical and bureaucratic. The free cohort never revealed this because nobody told me what they actually needed Spanish for. She told me because she'd paid and expected it to work.
- One kept switching to English mid-sentence and wanted the agent to gently push back. That became a profile flag — "tolerance for immersion: low" — which changes how aggressively EspaLuz responds in Spanish versus bilingual.
- One never used full sentences, just fired single words, and got frustrated when the agent over-explained. That taught me response length has to adapt to the user's own message length, which is now a routing input.
None of those three insights existed in my 100 free signups. Free users churn silently. Paying users complain, and complaints are the cheapest, highest-resolution product feedback you will ever get. Charge early — not to make money on three people, but because the price tag is what turns a tourist into a teacher.
What I'd tell a founder choosing the channel
If your users are already in a messaging app for the reason your product addresses, build the agent there. Don't make them come to you. A web app is the right call when the experience genuinely needs a screen — a code editor, a design canvas, a data dashboard you interact with. A language tutor, a coach, a reminder agent, anything that is fundamentally a conversation — that belongs in the thread the user already lives in.
And build the cheapest memory that solves your actual retrieval pattern. For conversational state about one person over a bounded history, two Postgres rows beat a vector store on cost, latency, and complexity. Add embeddings when you can name the specific query they'd answer that your rolling window and profile can't. If you can't name it, you don't need it yet.
Frequently Asked Questions
Q: Why not just use the LLM's native context window instead of a separate Postgres memory layer?
A: Because the context window is per-request and stateless across the webhook calls WhatsApp sends you. Each message is an independent HTTP event with no memory of the last one — you have to rehydrate state yourself every single time. Postgres is where that state lives between requests; the context window is just where you assemble it for one call.
Q: What does the Twilio/WhatsApp messaging cost actually look like at small scale?
A: That's the line item that matters more than your LLM bill at this size. Business-initiated conversation fees and per-message pricing vary by country, but for a handful of active users it ran me more than the Groq inference did. Price your subscription against messaging fees first; the model cost is the easy part to keep under a dollar per user.
Q: How do you keep the profile summary from drifting or hallucinating facts about the learner?
A: I don't summarize on every message — I run the profile update every few sessions, and the summarization prompt is constrained to update specific fields (level, recurring errors, topics, immersion tolerance) rather than rewrite freeform. Constraining the output schema kills most drift. I also keep the raw rolling window so the agent never relies on the summary alone for recent context.
Q: Three paying users isn't a sample. Why trust the retention number?
A: I don't trust the 100% as a number — I trust the mechanism behind it. The web app churned because it demanded a context switch; the WhatsApp agent didn't, and that's a structural difference, not a statistical one. The three users validated the mechanism, not the percentage. I'd expect retention to fall as the cohort grows, but the channel advantage is real regardless.
Q: Why route between Groq and Claude instead of just using one model for consistency?
A: Latency and cost have different shapes than quality. Most messages are short corrections where Groq's sub-second response matters more than Claude's nuance, and they're 80% of traffic. The 20% that need real reasoning or careful emotional handling justify Claude's cost. Using one model means either overpaying on easy turns or under-serving hard ones — routing lets you optimize each.