Operations playbook · AI infrastructure

AI Agent Operations
Runbook

AIdeazz is not a demo. Nine autonomous agents ship customer value today (tutoring, social, jobs, CRM, orchestration); one orchestration slot is reserved for AILA. Below is how reliability, releases, and revenue-facing automation are governed—plain language first, engineering depth where it earns trust.

Author
Elena Revicheva
Updated
May 2026
Status
Live in production

For engineers: repo layout, server paths, and incident write-ups live in ORACLE_ALL_PRODUCTS_RESILIENCE.md. This page stays founder-readable and avoids hostnames, ports, and secrets.

What a business owner should take away
0
Tracked capabilities · 9 live · 1 roadmap (AILA)
0
Infrastructure layers · Oracle · AWS · static edge
0
Automated health cadence
0
Codebases on the primary compute host
14
Manual steps in milestone → social pipeline
Section 01

Products & agents

Each row is a shipping capability—what customers or partners touch—with how it runs underneath (Linux services, process supervisor, or serverless). Naming matches the internal resilience matrix without exposing infrastructure coordinates.

CRM pipeline (HubSpot) + prospecting from hiring boards & product launches X growth automation (stream listening + engagement + alerts) Morning briefing audio via AWS Lambda + secure CTO data bridge Executive rhythm: Trello digests to Telegram
📌
Operational invariant: exactly one deployed checkout per GitHub repository—prevents version drift, duplicate secrets, and “which folder is live?” incidents. Pairs like CTO + creative co-founder deliberately share a codebase but run as distinct personalities/interfaces.
# Agent Role (business + ops) Runtime Status
01 EspaLuz WhatsApp Channel: Spanish tutoring on WhatsApp—conversation, drills, corrections.
Runs as a managed Linux service (espaluz-whatsapp) with automated health checks.
systemd ● Live
02 EspaLuz Telegram Channel: Same tutoring product on Telegram.
Two-layer memory: retrieval + pgvector RAG (espaluz_rag.py). Service espaluz-familybot.
systemd ● Live
03 EspaLuz Influencer Brand: Instagram publishing on a disciplined schedule; can spotlight real shipping milestones in consumer-friendly copy.
Groq captions + Make.com media handoff. Unit espaluz-influencer.
systemd ● Live
04 Algom Alpha (@reviceva) Growth: Always-on X presence (education + narrative); folds major releases into the timeline without sounding like raw developer logs.
Stream sampling, engagement runner, and account-activity hooks coordinated with the CTO bot for alerts / follow-back. PM2 workers include dragontrade-main and satellite processes.
PM2 ● Live
05 VibeJob Hunter Product: Autonomous job hunt pipeline—evaluation harness, routing, ATS integrations.
Shares codebase with the marketing co-founder agent. Worker vibejobhunter.
systemd ● Live
06 AI Marketing Co-Founder (CMO AIPA) Revenue narrative: LinkedIn cadence, long-form syndication, CRM hygiene—turns engineering momentum into market-facing proof.
Claude + connectors for social; Hunter.io enrichment → HubSpot. Paired FastAPI bridge vibejobhunter-web exposes an internal health route.
systemd ● Live
07 OpenClaw Vibejob Shortlist UX: Curated job shortlists delivered inside Telegram.
Standalone gateway service openclaw-gateway; probed via private health URL on the app host.
systemd ● Live
08 Tech Co-Founder (CTO AIPA) Control tower: Watches repositories, scores riskier changes, broadcasts milestones to marketing, runs outreach/board workflows.
Express orchestrator under PM2 (cto-aipa), Oracle Autonomous DB via wallet-based TLS—credentials never live in this HTML.
PM2 ● Live
08.1 Sprint Briefing (Sprinter) Founder ritual: Daily audio briefing synthesized from tasks, notes, and captures.
AWS Lambda on a schedule; pulls context through the CTO service over HTTPS with shared-secret auth—no database wallet inside Lambda.
Lambda ● Live
09 Creative Co-Founder (Atuona CCF) Creative partner: Separate bot persona + public studio site—same reliability envelope as the CTO stack.
Single PM2 orchestrator binary; site ships via static edge hosting.
PM2 ● Live
10 AILA Roadmap: Long-horizon personal orchestration—documented architecture, not yet a standalone production process.
Interim coordination fields live in Oracle until AILA ships.
In design
Section 02

Reliability & uptime practice

For founders: scheduled probes ask each product whether it still responds. For engineers: one bash driver on the primary Oracle VM runs roughly every five minutes; keep-alive traffic avoids idle reclamation; systemd caps restart storms.

No noisy herd restarts: failed probes recycle only the affected unit. Concrete URLs and scripts remain in the private appendix—not pasted here.
Agent
Health check method
Recovery action
CTO AIPA + Atuona
Orchestrator HTTP OK via localhost probe
pm2 restart cto-aipa
EspaLuz WhatsApp
Tutoring webhook answers OK from localhost
systemctl restart espaluz-whatsapp
VibeJob Hunter + CMO
Marketing bridge health endpoint OK internally
systemctl restart vibejobhunter-web vibejobhunter
OpenClaw Shortlist
HTTP GET gateway loopback → 200
systemctl restart openclaw-gateway
Sprint Briefing
CloudWatch + EventBridge schedule
Lambda retries / DLQ policy
PM2 stacks (e.g. Algom)
cron HTTP + pm2 jlist status online
pm2 restart <app>
All systemd agents
Process liveness via systemctl
systemd restart policy
Section 03

Go-to-market automation

When engineering ships something worth talking about, the stack fans it out across LinkedIn, blogs, X, and Instagram—without a human retyping the same story five times.

🔀
Quality gate: only commits tagged feat:, launch:, or release: notify the marketing agent. Housekeeping commits (fix:, docs:, chore:, …) stay invisible to customers.
1
GitHub Webhook

Commit detected → CTO AIPA

Push events hit the secured webhook. Groq/Claude review diffs, classify milestones, enqueue pending updates for downstream marketers.

2
LinkedIn · 20:00 Panama

CMO generates + posts

Claude Sonnet copy → Make.com delivery. Zero manual paste.

3
Hashnode + dev.to · Async

Blog crosspost

blog_publisher.py fires after LinkedIn: Hashnode essay + dev.to canonical backlink to aideazz.xyz.

4
X · Every 5th post slot

Algom Alpha tweet

x-tech-updater.js merges milestones in plain language (Haiku / Groq), guarded against duplicate queue states.

5
Instagram · Even days 18:00 Panama

EspaLuz Influencer

Milestone-aware caption + Make.com media pipeline; falls back to standard queue when nothing pending.

Section 04

Release discipline

Board-friendly translation: we ship like a product company—predictable processes, isolated secrets, verifiable rollouts—even though agents move faster than most teams.

Rule · 01

One live checkout per codebase

Eliminates “which folder is prod?” debates; paired bots share code intentionally but never duplicate repos.

Rule · 02

Green build, then swap

Pull latest → compile/tests succeed → only then restart supervised processes. Broken artifacts never replace what customers already rely on.

Rule · 03

Secrets isolation

Each bot owns its environment file; crypto wallets never touch GitHub; TypeScript strict mode catches sloppy typings before prod.

Rule · 04

PM2 persistence

pm2 startup + pm2 save on every new process; ecosystem files set max_restarts + autorestart.

Rule · 05

No silent failures

Crash handlers log before exit so supervisors show why something died; watchdog cadence targets ~5 minute detection.

Rule · 06

Verify after deploy

Health signal green, database connectivity logs clean, one real Telegram interaction—all pass before the incident is closed.

Section 05

Incident response template

For stakeholders: regressions are handled like financial reconciliations—symptoms, compounded causes, fix, proof—so the same automation trap rarely strikes twice.

🔁

HubSpot duplicate posting loop

May 10, 2026 — same milestone tweet emitted twice ~6 minutes apart

Symptom
Pending HubSpot milestones resurfaced every x-tech-updater.js cycle.
Root causes
Triple mismatch: legacy posted vs filter on posted_x; mark endpoint keyed on timestamp while older rows used received_at; backlog needed posted_x backfill.
Fix applied
GET excludes either flag; mark endpoint tries timestamp → received_at → title; JS client sends title for fallback matching.
Verified by
API snapshot {"ok": true, "pending": [], "total": 0, "held": true} + two full automation cycles without duplication.
Resolution
≈2 hours from detection → patched APIs → verified on live automation cycles. Full narrative retained in the engineering appendix linked below.
📝
SOP update rule: materially production-facing incidents earn the same structured write-up internally—so institutional memory compounds instead of resetting.
Section 06

Stack reference

Boring reliability primitives where uptime matters; sharp AI + CRM + social APIs where differentiation matters.

Process Mgmt
PM2 · systemd · AWS Lambda + EventBridge
Databases
Oracle Autonomous DB (walleted TLS, thick mode, multi-table estate) · PostgreSQL + pgvector (1536-d RAG)
CRM & Outreach
HubSpot CRM v3 + v4 associations · Hunter.io · Resend
Social
X API v2 (Account Activity, filtered stream, engagement worker) · Make.com · Telegram Bot API
AI / LLMs
Claude Sonnet / Haiku · Groq Llama 3.3 70B · OpenAI TTS / Whisper · LangChain · LangGraph
Lead Gen
HN Algolia · GitHub REST · Product Hunt GraphQL — ~150–250 net-new companies/month after filtering
Hosting / CDN
OCI Ubuntu VM (VM.Standard.E5.Flex, 12 GB) · AWS Lambda · 4everland IPFS frontends · Cloudflare DNS
Monitoring
Cron health driver · PM2 logs · CloudWatch · curl probes · OCI keep-alive
Content & SEO
Hashnode GraphQL · dev.to · GA4 · Google Search Console · GEO pack (llms.txt, crawler tokens)
Project Mgmt
Trello API (daily + weekly Telegram briefings) · GitHub webhooks across the fleet