Operations playbook · AI infrastructure

Versión en español →

AI Agent Operations
Runbook

AIdeazz is not a demo. Ten autonomous agents ship customer value today (tutoring, social, jobs, CRM, orchestration, marketing intelligence); one orchestration slot is reserved for AILA. Below is how reliability, releases, and revenue-facing automation are governed—plain language first, engineering depth where it earns trust.

Author
Elena Revicheva
Updated
June 2026
Status
Live in production

For engineers: authoritative product details—canonical agent list, GitHub repos, Oracle VM deploy paths, systemd/PM2 names, health checks, local Windows clones, and resilience postmortems—are maintained in ORACLE_ALL_PRODUCTS_RESILIENCE.md (engineering appendix). This public page stays founder-readable: no IPs, ports, or secrets.

What a business owner should take away
0
Tracked capabilities · 10 live · 1 roadmap (AILA)
0
Infrastructure layers · Oracle · AWS · static edge
0
Automated health cadence
0
Repos powering live products · all on one server
14
Manual steps in milestone → social pipeline
Section 01

Products & agents

Each row is a shipping capability—what customers or partners touch—with how it runs underneath (Linux services, process supervisor, or serverless). Naming matches the internal resilience matrix without exposing infrastructure coordinates.

GEO + SEO infrastructure · aideazz.xyz · AI-crawler signals · bilingual blog · lead capture → Oracle CRM pipeline (HubSpot) + prospecting from hiring boards & product launches X growth automation (stream listening + engagement + alerts) Morning briefing audio via AWS Lambda + secure CTO data bridge Executive rhythm: Trello digests to Telegram Live web data layer: Bright Data Web Unlocker · SERP · Scraping Browser + autonomous Claude research agent (/research_company · /research_employer · /research_competitor in Telegram) Multi-provider LLM failover, fleet-wide: every agent falls Claude → Groq (free Llama) at minimum—including all three EspaLuz bots—with Grok (xAI), OpenAI, and Gemini as extra tiers in higher-volume products (Atlas Shifted: Claude → Groq → OpenAI → Grok). The content engine survives any single provider outage.
Operational invariant: exactly one deployed checkout per GitHub repository—prevents version drift, duplicate secrets, and “which folder is live?” incidents. Pairs like CTO + creative co-founder deliberately share a codebase but run as distinct personalities/interfaces.
# Agent Role (business + ops) Runtime Status
01 EspaLuz WhatsApp Channel: Spanish tutoring on WhatsApp—conversation, drills, corrections.
Runs as a managed Linux service (espaluz-whatsapp) with automated health checks.
systemd ● Live
02 EspaLuz Telegram Channel: Same tutoring product on Telegram.
Two-layer memory: retrieval + pgvector RAG (espaluz_rag.py). Service espaluz-familybot.
systemd ● Live
03 EspaLuz Influencer Brand: Instagram publishing on a disciplined schedule; can spotlight real shipping milestones in consumer-friendly copy.
Groq captions + Make.com media handoff. Unit espaluz-influencer.
systemd ● Live
04 Algom Alpha (@reviceva) Growth: Always-on X presence (education + narrative); folds major releases into the timeline without sounding like raw developer logs.
Stream sampling, engagement runner, and account-activity hooks coordinated with the CTO bot for alerts / follow-back. PM2 workers include dragontrade-main and satellite processes.
PM2 ● Live
05 VibeJob Hunter Product: Autonomous job hunt pipeline—evaluation harness, routing, ATS integrations.
Shares codebase with the marketing co-founder agent. Worker vibejobhunter.
systemd ● Live
06 AI Marketing Co-Founder (CMO AIPA) Revenue narrative: LinkedIn cadence, long-form syndication, CRM hygiene—turns engineering momentum into market-facing proof.
Claude + connectors for social; Hunter.io enrichment → HubSpot. Paired FastAPI bridge vibejobhunter-web exposes an internal health route.
systemd ● Live
07 OpenClaw Vibejob Shortlist UX: Curated job shortlists delivered inside Telegram.
Standalone gateway service openclaw-gateway; probed via private health URL on the app host.
systemd ● Live
08 Tech Co-Founder (CTO AIPA) Control tower: Watches repositories, scores riskier changes, broadcasts milestones to marketing, runs outreach/board workflows.
Express orchestrator under PM2 (cto-aipa), Oracle Autonomous DB via wallet-based TLS—credentials never live in this HTML.
PM2 ● Live
08.1 Sprint Briefing (Sprinter) Founder ritual: Daily audio briefing synthesized from tasks, notes, and captures.
AWS Lambda on a schedule; pulls context through the CTO service over HTTPS with shared-secret auth—no database wallet inside Lambda.
Lambda ● Live
09 Creative Co-Founder (Atuona CCF) Creative partner: Separate bot persona + public studio site—same reliability envelope as the CTO stack.
Single PM2 orchestrator binary; site ships via static edge hosting.
PM2 ● Live
10 Atlas Shifted (Marketing Strategist) Marketing intelligence: Watches the public ad market daily, detects which creative angle is opening before it saturates (ENTER / WATCH / AVOID), then generates the evidence-grounded creative—image + video via the Atuona pipeline. Feeds the marketing engine (dogfooded on the EspaLuz expat vertical).
Capture-first immutable JSONL log + OpenAI-embedding angle classifier + SQLite projection; off-VM daily backup. PM2 (whitespace:8095) + Panama-time cron. Public ad-transparency data only—no spend/CTR claimed.
PM2 ● Live
11 AILA Roadmap: Long-horizon personal orchestration—documented architecture, not yet a standalone production process.
Interim coordination fields live in Oracle until AILA ships.
In design
Section 02

Reliability & uptime practice

For founders: scheduled probes ask each product whether it still responds. For engineers: one bash driver on the primary Oracle VM runs roughly every five minutes; keep-alive traffic avoids idle reclamation; systemd caps restart storms.

No noisy herd restarts: failed probes recycle only the affected unit. Concrete URLs and scripts remain in the private appendix—not pasted here.
Agent
Health check method
Recovery action
CTO AIPA + Atuona
Orchestrator HTTP OK via localhost probe
pm2 restart cto-aipa
EspaLuz WhatsApp
Tutoring webhook answers OK from localhost
systemctl restart espaluz-whatsapp
VibeJob Hunter + CMO
Marketing bridge health endpoint OK internally
systemctl restart vibejobhunter-web vibejobhunter
OpenClaw Shortlist
HTTP GET gateway loopback → 200
systemctl restart openclaw-gateway
Sprint Briefing
CloudWatch + EventBridge schedule
Lambda retries / DLQ policy
PM2 stacks (e.g. Algom)
cron HTTP + pm2 jlist status online
pm2 restart <app>
All systemd agents
Process liveness via systemctl
systemd restart policy
Section 03

Go-to-market automation

When engineering ships something worth talking about, the stack fans it out across LinkedIn, blogs, X, and Instagram—without a human retyping the same story five times.

Quality gate: only commits tagged feat:, launch:, or release: notify the marketing agent. Housekeeping commits (fix:, docs:, chore:, …) stay invisible to customers.
1
GitHub Webhook

Commit detected → CTO AIPA

Push events hit the secured webhook. Groq/Claude review diffs, classify milestones, enqueue pending updates for downstream marketers.

2
LinkedIn · 20:00 Panama

CMO generates + posts

Claude Sonnet copy → Make.com delivery. Zero manual paste.

3
Daily blog · aideazz.xyz/blog + dev.to

Bilingual blog publish (EN/ES)

The daily blog publisher ships articles to aideazz.xyz/blog with dev.to crosspost (“Also on Dev.to”). Sliding-window mutex, title dedup, and always-notify Telegram guard against silent double publishes.

4
X · Every 5th post slot

Algom Alpha tweet

x-tech-updater.js merges milestones in plain language (Haiku / Groq), guarded against duplicate queue states.

5
Instagram · Even days 18:00 Panama

EspaLuz Influencer

Milestone-aware caption + Make.com media pipeline; falls back to standard queue when nothing pending.

Section 04

Release discipline

Board-friendly translation: we ship like a product company—predictable processes, isolated secrets, verifiable rollouts—even though agents move faster than most teams.

Rule · 01

One live checkout per codebase

Eliminates “which folder is prod?” debates; paired bots share code intentionally but never duplicate repos.

Rule · 02

Green build, then swap

Pull latest → compile/tests succeed → only then restart supervised processes. Broken artifacts never replace what customers already rely on.

Rule · 03

Secrets isolation

Each bot owns its environment file; crypto wallets never touch GitHub; TypeScript strict mode catches sloppy typings before prod.

Rule · 04

Crash-proof process registry

Every agent process is registered to auto-start on server boot and auto-restart on failure—no manual babysitting after a power cycle or kernel update.

Rule · 05

No silent failures

Crash handlers log before exit so supervisors show why something died; watchdog cadence targets ~5 minute detection.

Rule · 06

Verify after deploy

Health signal green, database connectivity logs clean, one real Telegram interaction—all pass before the incident is closed.

Section 05

Incident response template

For stakeholders: regressions are handled like financial reconciliations—symptoms, compounded causes, fix, proof—so the same automation trap rarely strikes twice.

HubSpot duplicate posting loop

May 10, 2026 — same milestone tweet emitted twice ~6 minutes apart

Symptom
Pending HubSpot milestones resurfaced every x-tech-updater.js cycle.
Root causes
Triple mismatch: legacy posted vs filter on posted_x; mark endpoint keyed on timestamp while older rows used received_at; backlog needed posted_x backfill.
Fix applied
GET excludes either flag; mark endpoint tries timestamp → received_at → title; JS client sends title for fallback matching.
Verified by
API snapshot {"ok": true, "pending": [], "total": 0, "held": true} + two full automation cycles without duplication.
Resolution
≈2 hours from detection → patched APIs → verified on live automation cycles. Full narrative retained in the engineering appendix linked below.

The engagement loop that never ran

May 25, 2026 — config said “32 engagements/day”; logs said zero cycles had ever completed

Symptom
Asked the logs to prove a claimed engagement rate. Startup banner found 4,357 times; cycle-completed action line found zero times. The behavior never happened, no matter what the config said.
Root causes
Three layers deep: the first engagement run was scheduled 5 minutes after startup; the process was being restarted every 5 minutes by an external cron; and that cron was a health check whose grep never matched PM2’s box-drawing table output—so it judged a healthy process dead, forever.
Fix applied
Health check rewritten to read structured state: pm2 jlist | jq on the process status field instead of grepping rendered text. Process stayed up; the first engagement cycle in the bot’s history fired the same day, with real replies and follows verified from logs.
Rule earned
Verify from logs, not config. Never claim agent behavior without grepping for the ACTION line (not the setup line). The fix is now a standing operating rule across the fleet.
SOP update rule: materially production-facing incidents earn the same structured write-up internally—so institutional memory compounds instead of resetting.
Section 06

Stack reference

Boring reliability primitives where uptime matters; sharp AI + CRM + social APIs where differentiation matters.

Process Mgmt
PM2 · systemd · AWS Lambda + EventBridge
Databases
Oracle Autonomous DB (enterprise-grade encrypted connection, multi-table estate) · PostgreSQL + pgvector (semantic memory for tutoring agents)
CRM & Outreach
HubSpot CRM v3 + v4 associations · Hunter.io · Resend
Social
X API v2 (Account Activity, filtered stream, engagement worker) · Make.com · Telegram Bot API
AI / LLMs
Claude Sonnet / Haiku · Groq Llama 3.3 70B · Grok (xAI, tier-3 failover) · OpenAI (gpt-4o-mini + text-embedding-3-small + TTS / Whisper) with retry + provider-fallback chain · Bright Data (public ad-transparency capture) · Runway (Seedance 2.0 / Kling 3.0) + Flux 1.1 Pro for Atlas creative · LangChain · LangGraph
Lead Gen
HN Algolia · GitHub REST · Product Hunt GraphQL · Bright Data (SERP API + Web Unlocker + Scraping Browser — replaced paid SerpAPI) — ~150–250 net-new companies/month after filtering
Hosting / CDN
OCI Ubuntu VM (VM.Standard.E5.Flex, 12 GB) · AWS Lambda · 4everland IPFS frontends · Cloudflare DNS
Monitoring
Cron health driver · PM2 logs · CloudWatch · curl probes · OCI keep-alive
Content & SEO
Bilingual daily blog (EN/ES, aideazz.xyz/blog) · dev.to crosspost · GA4 · Google Search Console · GEO pack (llms.txt, crawler tokens) · FAQPage JSON-LD (AEO)
Project Mgmt
Trello API (daily + weekly Telegram briefings) · GitHub webhooks across the fleet