How to Reduce Failures and Improve Reliability of AI Agents in Production: Control Plane for 25+ LLMs with Hallucination Detection

I’m Denis Shokhirev, Enterprise AI architect based in Erlangen, Germany. At DennisCraft AI Studio I ship production AI systems for DACH B2B clients—logistics, fintech, industrial automation—using a stack built around Claude, Supabase, n8n, Doppler, and self-hosted Postgres. The most common pain after launch: unpredictable agent failures and invisible LLM hallucinations that disrupt real-world business flows. This is not theory—these are patterns I see every week in deployed systems.

Production pains: Why LLM agents fail at scale

When you orchestrate 25+ LLMs across multiple B2B workflows, failures cluster around a few repeat offenders:

LLMs return inconsistent answers on near-identical queries
Integration crashes: timeouts, nulls, malformed JSON
Hallucinated content—fictional products, invalid SQL, security risks
Brittle n8n pipelines that break on unexpected LLM output

In three recent agent rollouts, I logged 10–15% of flows breaking due to malformed or hallucinated LLM responses—not infra bugs, but model artifacts. The longer your pipelines, the more hidden these issues become.

Control plane architecture: How I keep agents stable

1. Centralized LLM call logging with Supabase

Every LLM call (prompt, params, response, status, latency) gets logged in a dedicated Postgres table via Supabase. This powers dashboards for error rates, outlier detection, and latency spikes.


import supabase
from datetime import datetime

def log_llm_call(user_id, prompt, response, status):
    data = {
        "user_id": user_id,
        "prompt": prompt,
        "response": response,
        "status": status,
        "created_at": datetime.utcnow()
    }
    supabase.table("llm_logs").insert(data).execute()

In a fintech automation deployment, this surfaced a 7% unstable response rate in the first week—root cause was only visible by mining these logs.

2. Real-time hallucination detection

LLMs will generate plausible but unsafe output: non-existent SKUs, invalid code, or dangerous commands. I run runtime validation on every agent output:

SQL parsing (with sqlparse) to catch syntax errors before DB execution
Semantic checks against whitelists—e.g., known product IDs
Sanity checks for numerical and boolean fields

For database-driven flows, a pattern like this blocks most hallucinated queries:


import sqlparse

def is_valid_sql(query):
    try:
        parsed = sqlparse.parse(query)
        return len(parsed) > 0
    except Exception:
        return False

def check_output(output):
    if not is_valid_sql(output):
        return False
    # Add semantic validation here
    return True

3. Multi-model fallback and A/B evaluation

With 25+ LLMs (Claude, GPT-4, Llama, Mistral, and others) I use fallback routing: if the primary model fails or outputs junk, the request is retried on a backup. For critical operations, I run real-world A/B tests—same prompt, two LLMs, compare results via checksum or semantic diff.

Model	Avg. error rate (%)	Avg. response time (s)
Claude 3 Opus	4.3	2.8
GPT-4 Turbo	7.1	3.2
Llama 2-70B	9.5	1.7

I persist a local model leaderboard in Supabase for quick pivoting.

n8n: Where brittle LLM output kills your flows

n8n is my go-to workflow orchestrator. I integrate LLM agents via custom nodes that validate output before passing it downstream. If output fails a check, the flow halts, logs an incident, and pings a Slack channel.


// Custom n8n node for validating LLM JSON output
export function validateLLMOutput(output: string): boolean {
  try {
    const data = JSON.parse(output);
    // Only accept if required keys exist
    if (!data.hasOwnProperty('order_id')) return false;
    return true;
  } catch (e) {
    return false;
  }
}

In an industrial automation rollout, this runtime guard dropped faulty transactions from 12 per month to just 1 after two weeks.

Access control and traceability—non-negotiable for regulated workflows

For any process touching personal or sensitive data (GDPR/DSGVO), every LLM call passes through an audit layer: who, when, what input, what output. For sensitive fields, I mask data before logging to Supabase.

Each request gets a unique trace_id, so I can reconstruct the full chain for any incident.

FAQ

Why use so many LLMs—can’t one do the job?

No single LLM covers all real B2B use cases: languages, formats, latency, and regulatory needs vary. Multi-model setups add resilience and flexibility.

How do you detect hallucinations without manual review?

I run layered auto-checks: syntax, semantic whitelist validation, and output diffing against known-good values. Manual audit is for edge cases only.

How long does it take to build this control plane?

The first MVP (Supabase logging + basic validation) is a 2–3 day job. Expansion is iterative as new models and flows are added.

Can you do this without Supabase?

Yes—self-hosted Postgres and REST APIs work, but Supabase gives rapid dashboards, auth, and webhooks out of the box.

How do you prevent data leaks via LLMs?

I mask sensitive fields, restrict allowed inputs via pre-checks, and log a full audit trail for every LLM call. For fintech, I run models in isolated environments.

At which stage in your LLM pipeline do most production issues surface—runtime validation, integration tests, or manual review? I’d genuinely like to know.

I run a free 30-min stack audit for DACH founders building AI in regulated markets. DM me on LinkedIn or write to @ger_dennis_ai.