Building a 24/7 AI Agent Platform from Scratch: Lessons from a 300K LOC System

I'm Denis Shokhirev, Enterprise AI architect based in Erlangen, Germany. At DennisCraft AI Studio, I ship AI systems to DACH B2B clients in logistics, fintech, and industrial automation, using a stack of Claude, Supabase, n8n, Doppler, and self-hosted Postgres. Shipping 14 production AI agents in six months exposed pain points that don't show up in demos: concurrency bugs, token exhaustion, and LLM code exposing real-world risk. This post breaks down the architecture of my 300K LOC platform—the real patterns that survive regulated European production, not slides.

Core Architecture: Stable, Isolated Agents First

Design Pattern

The goal is to isolate each AI agent, ensuring a stable pipeline: job queues, fail tracking, tight control of external calls (Claude Code, OpenAI API), and granular monitoring. Each agent runs as a separate process, orchestrated via async queues (Supabase Realtime, Redis pub/sub). Why not microservices? For AI agents, process pools are simpler—otherwise, shared state and API quota management become a nightmare.

Production-Grade Stack (with Real Drawbacks)

Component	Why I Chose It	Pain Points
Claude Code / Anthropic SDK	Best price/quality for reasoning-heavy agents	Strict rate limits, occasional latency spikes
Supabase	Fast pub/sub and metadata storage	Realtime sometimes drops events, fallback needed
n8n	Pipeline orchestration, visual editing	Debugging deep chains is hard, retry bugs pop up
Doppler	Secret management, simple CI	Lacks granular audit trails for large teams
Self-hosted Postgres	GDPR compliance, data control	Bottlenecks under load; query tuning required

Data Flow: From Request to Audit Log

Request Handling Pattern

Every incoming request (API or UI) is validated via pydantic schemas, then dropped into a Supabase queue. An agent process pulls pending jobs asynchronously and runs all steps: preprocessing, LLM call (Claude/Anthropic), postprocessing, and result storage in Postgres.


from supabase import create_client
import asyncio

async def process_task(supabase_url, supabase_key):
    supabase = create_client(supabase_url, supabase_key)
    while True:
        task = supabase.table('tasks').select('*').eq('status', 'pending').limit(1).execute()
        if task.data:
            result = run_agent_logic(task.data[0])
            supabase.table('tasks').update({'status': 'done', 'result': result}).eq('id', task.data[0]['id']).execute()
        await asyncio.sleep(1)

Audit Logging

Every LLM call is logged in a dedicated Postgres table: prompt, output, latency, user ID. For GDPR, I maintain a separate audit trail: who, when, what prompt, what output. After a fintech client incident where an LLM returned a risky output, I introduced manual review for 2% of random jobs using n8n + Notion as a review queue.

Security: Never Trust LLM-Generated Code

LLM Output as a Production Risk

Most production vulnerabilities in my stack come not from inbound requests, but from LLM-generated code. On three recent agent deployments, I caught SQL-injection and unsafe shell execution patterns in Python snippets generated by Claude. For static analysis, I run semgrep, bandit, and occasionally gitleaks for secret scanning. This matches findings from the 2023 Anthropic LLM Security paper (source), which highlights prompt-injection and code-gen risks.


semgrep --config=python security/ --error
bandit -r ./agents/
gitleaks detect --source=./

Sandboxing: Containing LLM Output

All LLM-generated code is executed in a sandboxed container (firejail + custom Docker) with strict limits on CPU, memory, and network calls. After a 2024 prompt-injection incident (a malicious SQL DELETE in a RAG agent), I added regex-based prompt filtering and enforced runtime sandboxes. No LLM code runs with production credentials or direct DB access.

Monitoring and Alerting: What Actually Works

Metrics and Alert Patterns

Metrics ship to a self-hosted Prometheus + Grafana setup: per-agent latency, error rates, queue health. For critical alerts, I push to a Telegram bot. Example: if latency > 5s or error rate > 2% over 10 minutes, immediate notification triggers.


groups:
- name: ai-agent-alerts
  rules:
  - alert: HighLatency
    expr: avg_over_time(agent_latency[5m]) > 5
    for: 2m
    annotations:
      summary: "High latency detected in AI agent"

FAQ

Why not use an off-the-shelf no-code AI platform?

Everything on the market either fails GDPR (third-party data processing) or can't support complex agent pipelines. Self-hosted and full control are mandatory for European production.

How do you test agent pipeline reliability?

Every pipeline step has unit tests. Once a week, I run end-to-end tests via n8n. For LLM outputs, I snapshot results and diff against golden datasets.

How do you handle API rate limits?

Job queues and retry logic. Supabase queues plus one process pool per LLM endpoint prevent 429 errors in production.

How do you manage secrets and tokens?

Doppler for central secret storage, with role-based access. Critical keys are never logged, never leave the production server.

What about scaling up?

So far, horizontal scaling—new agent processes, separate queues, Postgres replicas—is enough. For 100+ agents, I'll move to Kubernetes or similar orchestration.

Which part of your agent stack causes the most production incidents: job queueing, LLM logic, or external service integrations? I'd genuinely like to know. I run a free 30-min stack audit for DACH founders building AI in regulated markets. DM me on LinkedIn or write to @ger_dennis_ai.