How to Reduce Failures and Improve Reliability of AI Agents in Production: Control Plane for 25+ LLMs with Hallucination Detection
I’m Denis Shokhirev, Enterprise AI architect based in Erlangen, Germany. At DennisCraft AI Studio I ship production AI systems for DACH B2B clients—logistics, fintech, industrial automation—using a stack built around Claude, Supabase, n8n, Doppler, and self-hosted Postgres. The most common pain after launch: unpredictable agent failures and invisible LLM hallucinations that disrupt real-world business flows. This is not theory—these are patterns I see every week in deployed systems. Production
I’m Denis Shokhirev, Enterprise AI architect based in Erlangen, Germany. At DennisCraft AI Studio I ship production AI systems for DACH B2B clients—logistics, fintech, industrial automation—using a stack built around Claude, Supabase, n8n, Doppler, and self-hosted Postgres. The most common pain after launch: unpredictable agent failures and invisible LLM hallucinations that disrupt real-world business flows. This is not theory—these are patterns I see every week in deployed systems.
Production pains: Why LLM agents fail at scale
When you orchestrate 25+ LLMs across multiple B2B workflows, failures cluster around a few repeat offenders:
- LLMs return inconsistent answers on near-identical queries
- Integration crashes: timeouts, nulls, malformed JSON
- Hallucinated content—fictional products, invalid SQL, security risks
- Brittle n8n pipelines that break on unexpected LLM output
In three recent agent rollouts, I logged 10–15% of flows breaking due to malformed or hallucinated LLM responses—not infra bugs, but model artifacts. The longer your pipelines, the more hidden these issues become.
Control plane architecture: How I keep agents stable
1. Centralized LLM call logging with Supabase
Every LLM call (prompt, params, response, status, latency) gets logged in a dedicated Postgres table via Supabase. This powers dashboards for error rates, outlier detection, and latency spikes.
import supabase
from datetime import datetime
def log_llm_call(user_id, prompt, response, status):
data = {
"user_id": user_id,
"prompt": prompt,
"response": response,
"status": status,
"created_at": datetime.utcnow()
}
supabase.table("llm_logs").insert(data).execute()
In a fintech automation deployment, this surfaced a 7% unstable response rate in the first week—root cause was only visible by mining these logs.
2. Real-time hallucination detection
LLMs will generate plausible but unsafe output: non-existent SKUs, invalid code, or dangerous commands. I run runtime validation on every agent output:
- SQL parsing (with sqlparse) to catch syntax errors before DB execution
- Semantic checks against whitelists—e.g., known product IDs
- Sanity checks for numerical and boolean fields
For database-driven flows, a pattern like this blocks most hallucinated queries:
import sqlparse
def is_valid_sql(query):
try:
parsed = sqlparse.parse(query)
return len(parsed) > 0
except Exception:
return False
def check_output(output):
if not is_valid_sql(output):
return False
# Add semantic validation here
return True
3. Multi-model fallback and A/B evaluation
With 25+ LLMs (Claude, GPT-4, Llama, Mistral, and others) I use fallback routing: if the primary model fails or outputs junk, the request is retried on a backup. For critical operations, I run real-world A/B tests—same prompt, two LLMs, compare results via checksum or semantic diff.
| Model | Avg. error rate (%) | Avg. response time (s) |
|---|---|---|
| Claude 3 Opus | 4.3 | 2.8 |
| GPT-4 Turbo | 7.1 | 3.2 |
| Llama 2-70B | 9.5 | 1.7 |
I persist a local model leaderboard in Supabase for quick pivoting.
n8n: Where brittle LLM output kills your flows
n8n is my go-to workflow orchestrator. I integrate LLM agents via custom nodes that validate output before passing it downstream. If output fails a check, the flow halts, logs an incident, and pings a Slack channel.
// Custom n8n node for validating LLM JSON output
export function validateLLMOutput(output: string): boolean {
try {
const data = JSON.parse(output);
// Only accept if required keys exist
if (!data.hasOwnProperty('order_id')) return false;
return true;
} catch (e) {
return false;
}
}
In an industrial automation rollout, this runtime guard dropped faulty transactions from 12 per month to just 1 after two weeks.
Access control and traceability—non-negotiable for regulated workflows
For any process touching personal or sensitive data (GDPR/DSGVO), every LLM call passes through an audit layer: who, when, what input, what output. For sensitive fields, I mask data before logging to Supabase.
Each request gets a unique trace_id, so I can reconstruct the full chain for any incident.
FAQ
Why use so many LLMs—can’t one do the job?
No single LLM covers all real B2B use cases: languages, formats, latency, and regulatory needs vary. Multi-model setups add resilience and flexibility.
How do you detect hallucinations without manual review?
I run layered auto-checks: syntax, semantic whitelist validation, and output diffing against known-good values. Manual audit is for edge cases only.
How long does it take to build this control plane?
The first MVP (Supabase logging + basic validation) is a 2–3 day job. Expansion is iterative as new models and flows are added.
Can you do this without Supabase?
Yes—self-hosted Postgres and REST APIs work, but Supabase gives rapid dashboards, auth, and webhooks out of the box.
How do you prevent data leaks via LLMs?
I mask sensitive fields, restrict allowed inputs via pre-checks, and log a full audit trail for every LLM call. For fintech, I run models in isolated environments.
At which stage in your LLM pipeline do most production issues surface—runtime validation, integration tests, or manual review? I’d genuinely like to know.
I run a free 30-min stack audit for DACH founders building AI in regulated markets. DM me on LinkedIn or write to @ger_dennis_ai.
Turn your process into an AI system
Fixed price. Production quality. DACH B2B focus.