I shut down the web app after 100 free signups produced zero conversations longer than four messages. The WhatsApp version had three paying users inside two weeks, and one of them had a 40-message thread spanning six days. That gap — not the signup count, the conversation depth — is the whole story of EspaLuz, the Spanish-learning agent I built for my own kid first and then for strangers.
The lesson nobody tells you: the channel is part of the architecture. You can build the cleverest memory layer in the world, but if the user has to remember to open a tab, log in, and "start a session", your retention curve is already dead. WhatsApp solved a problem my code couldn't.
The web app was technically fine and practically useless
The first version of EspaLuz was a React frontend talking to a FastAPI backend on Oracle Cloud (Always Free tier — two AMD VMs, no GPU, which matters later). Auth via magic link. Conversation history in Postgres. Claude for the tutoring responses, with a system prompt that adapted difficulty based on the learner's level.
It worked. In demos it looked great. The error logs were clean. And it died anyway.
Here is what the analytics showed after a month:
- 102 signups
- 71 completed at least one message
- 12 returned a second day
- 0 reached day seven
The web app forced a context switch. Learning a language is a habit you wedge into the cracks of a day — waiting for coffee, on a bus, in the three minutes before a meeting. Nobody opens a browser tab and logs into a learning portal in those cracks. They open WhatsApp, because WhatsApp is already open.
I had built a product that competed for the user's "focused study time" budget. That budget is roughly zero for most adults. The channel was wrong, and no amount of prompt engineering fixes the wrong channel.
Why WhatsApp changed the retention math
I rebuilt EspaLuz as a WhatsApp agent. Same Claude tutoring core, same Oracle backend, but the interface was now a thread the user already lived in. The behavioral difference was immediate and not subtle.
The first paying user — a Panamanian dad trying to keep up with his bilingual daughter — sent messages at 6:40 AM, at lunch, and again at 11 PM. No session. No login. He treated EspaLuz like a person who was always available, because in WhatsApp that's exactly what an agent feels like.
This is the part that's invisible until you ship it: AI language learning over WhatsApp conversation memory is a fundamentally different product than the same logic in a web app. Not because the model is smarter. Because the contract with the user changed. The thread is permanent. The user expects you to remember what they said yesterday, the way a tutor would. If you don't, the illusion collapses and they stop paying.
So the channel forced a memory requirement that the web app let me get lazy about. In the web app, every session felt like a fresh start because the UI looked like a fresh start. In WhatsApp, a fresh start is a betrayal.
Two-layer memory without paying for a vector store
Everyone reaches for Pinecone or Weaviate the moment "conversation memory" comes up. I didn't, and EspaLuz runs fine. Here's why, and here's the architecture.
The instinct is: embed every message, store the vectors, do semantic search on every turn to retrieve "relevant" past context. For a tutoring agent that's mostly overkill, and it adds latency and a monthly bill for capability I wasn't using. A managed vector store would have cost me real money per month for a three-user product. On a bootstrapped budget that's not a rounding error.
Instead, two layers:
Layer 1 — rolling working memory. The last N turns of the actual conversation, kept verbatim in Postgres and injected into the prompt. For EspaLuz, N is around 12–16 turns depending on token budget. This is the cheap, high-fidelity short-term memory. It handles "what were we just talking about" without any embeddings at all.
Layer 2 — structured learner state. Not raw messages — distilled facts. After each session (or every few turns), a cheap model call extracts and updates a small JSON profile: current level, recurring mistakes, vocabulary the user struggled with, topics they care about (the dad's profile literally has "wants to talk to his daughter about dinosaurs"). This profile is small — a few hundred tokens — and it's injected into every prompt.
learner_profile = {
"level": "A2-B1",
"recurring_errors": ["ser vs estar", "preterite vs imperfect"],
"weak_vocab": ["scientific terms", "kitchen verbs"],
"interests": ["daughter's school", "dinosaurs", "cooking"],
"last_topic": "ordering food at a restaurant"
}
So the prompt for each turn is: system instructions + learner_profile (long-term) + last 12–16 turns (short-term) + current message. That's it. No vector search. No similarity threshold tuning. No vector DB bill.
Does this miss the case where the user references something from three weeks ago that's not in the profile and not in the recent window? Yes. It happens maybe once every few hundred turns, and when it does, the agent gracefully asks for a reminder — which is what a human tutor does too. The tradeoff is heavily in favor of the simple architecture for this use case. I'd revisit it past a few thousand active users, not before.
The distillation step is the actual cleverness, not the storage. Raw chat logs are a terrible long-term memory — they're noisy, they blow up your token budget, and most of the content is irrelevant on any given turn. A maintained structured profile is dense signal. This is where the engineering effort should go.
Model routing: Groq for the cheap stuff, Claude for the teaching
The two-layer memory creates two distinct kinds of model calls, and they shouldn't use the same model.
The Layer 2 distillation — "read these recent turns, update this JSON profile" — is a structured extraction task. It doesn't need Claude. I route it to a fast model on Groq, where latency is low and per-token cost is a fraction of Claude's. It runs in the background after a turn, so even if it's slightly less precise, the user never feels it.
The actual tutoring response — the thing the user reads, where tone and pedagogical judgment matter — goes to Claude. This is where you don't cut corners. A learner can instantly tell when a language tutor's correction is mechanical versus genuinely helpful, and Claude's responses are noticeably warmer and more pedagogically aware for this task.
The routing logic is boring on purpose:
- User-facing tutoring turn → Claude
- Background profile update → Groq (fast model)
- Simple intent classification ("is this a question or a translation request") → Groq
This split cut my per-conversation cost meaningfully while keeping the quality where the user actually perceives it. The mistake I see other builders make is routing everything to the premium model "to be safe" — you're paying Claude prices to update a JSON blob nobody reads. Or the opposite, routing everything to a cheap model and wondering why retention is bad. Match the model to whether the output is seen or not.
What 3 paying users taught me that 100 free signups couldn't
The 100 free signups gave me vanity. The three paying users gave me a product.
Paying users tell you the truth about your bugs. A free user who hits a confusing reply just leaves silently. A paying user messages me: "It corrected me but didn't explain why." That single message led me to add explanation depth as a learner-profile preference. Free users churn quietly; paying users complain usefully.
The thing they pay for is not the feature you think. I assumed people wanted grammar correction. The dad paying me wanted continuity — he wanted EspaLuz to remember that last week he was learning restaurant vocabulary so this week could build on it. The memory architecture was the product, not a supporting feature. I only learned that because someone with skin in the game told me what made them come back.
Three engaged users surface every edge case that matters at this stage. Between them, my three paying users hit: code-switching mid-sentence (Spanish + English), voice notes (which I had to handle or lose them), late-night usage where the profile from a tired session polluted long-term state, and a case where one user shared the agent with a family member and two learners ended up in one thread. That last one broke my single-profile assumption and forced a per-sender memory keying fix. A hundred free signups produced none of these because none of them used the thing long enough to find the cracks.
The brutal reframe: signups are a measure of your marketing. Paying retention is the only measure of your product. I spent weeks proud of 100 signups for software that nobody actually wanted to use. Three people paying small amounts taught me more in two weeks than a month of free-tier dashboards.
What I'd tell another technical founder building agents
Pick the channel where the user already is, before you pick the framework. I lost a month optimizing a web app for a behavior — opening a learning portal — that simply doesn't happen at the frequency a habit product needs. WhatsApp and Telegram win for habit-formation agents because the cost of re-engagement is one notification, not a login flow.
Don't pay for a vector store until you've proven the structured-profile-plus-recent-window approach genuinely fails for your use case. For conversational agents with a clear learner/customer state, distilled structured memory beats raw embedding search on cost, latency, and debuggability. You can read a JSON profile and understand exactly why the agent said what it said. You cannot easily debug a vector similarity miss.
Route your models by visibility. Premium model for what the user reads, cheap fast model for the plumbing. And get three people to pay you before you celebrate a hundred who don't.
Frequently Asked Questions
Q: Why not use the WhatsApp Business API's own message history instead of storing turns in Postgres?
A: The WhatsApp Business API gives you inbound webhooks, not a queryable conversation history you control. You get the message when it arrives and then it's yours to keep or lose. You need your own store for the recent-turns window and the learner profile — Postgres on an Oracle Always Free VM handles a few thousand users' worth of this without breaking a sweat.
Q: How do you stop the Layer 2 profile from accumulating garbage over months of use?
A: The distillation prompt is instructed to replace and consolidate, not append. Each update rewrites fields like recurring_errors based on recent evidence, so an error the user has clearly mastered drops off. I also cap the profile size in tokens — if extraction tries to grow it past the cap, older low-confidence items get pruned. It's maintenance, not accumulation.
Q: Doesn't dropping the vector store mean you can't do semantic recall at all?
A: Correct, and that's a deliberate tradeoff for this use case. Tutoring rarely needs "find me a relevant moment from three weeks ago" — it needs "what's this learner's level and what did we cover recently", which the profile and recent window handle. If I were building a support agent over a large document corpus, I'd reach for retrieval. For a conversational habit product, structured state wins on cost and explainability.
Q: What's the actual latency hit from running the Groq distillation step?
A: Zero perceived, because it runs after the user-facing reply is sent, not before. The Claude tutoring response goes out first; the background profile update happens asynchronously and is ready for the next turn. The user never waits on it. If you put distillation in the critical path you'd feel it — don't.
Q: Three paying users is not a business. Why build infrastructure for that?
A: I didn't over-build it — the whole stack runs on free-tier Oracle compute and avoids every paid managed service I could avoid. The infrastructure decisions were about being cheap and debuggable enough to learn from three users without burning runway. Build the minimum that lets paying users teach you; scale the architecture when paying users force you to, not before.