How We Built a 20-Lane Supervisor That Triages Operational Signals With Deterministic Heuristics Before Any AI Touches Them
We run 25 microservices on a single VPS. Each service produces operational signals — failed systemd units, stuck pipeline leads, stale research artifacts, model evaluation regressions. Somebody has to watch all of it.
We tried the obvious thing first: let an AI model read the system state and tell us what needs attention. It worked great for about a week, then started hallucinating incidents, re-opening resolved issues, and escalating items that a human had already triaged. The model couldn't know what we'd already handled because it had no memory of our decisions.
So we built a lane-based supervisor that inverts the usual pattern. Deterministic heuristics run first — no model involved. An optional AI refinement pass runs second, bounded and guarded. And a layer of lane-specific guards sits between the model's output and the operator's review queue, suppressing anything the deterministic layer already resolved.
This post walks through the architecture, the guard pattern that makes it work, and the alias-matching bug that taught us why key normalization in suppression logic is harder than it looks.
Architecture Overview
The supervisor splits operational concerns into 20 independent lanes:
┌──────────────────────────────────────────────────┐
│ DIGEST (aggregator) │
├──────────┬──────────┬──────────┬────────────────┤
│ ops │ leads │ evals │ revenue ... │
│ (10m) │ (15m) │ (30m) │ (30m) │
├──────────┴──────────┴──────────┴────────────────┤
│ 19 worker lanes + 1 digest lane │
│ Each: own timer, own state, own review queue │
└──────────────────────────────────────────────────┘
Nineteen worker lanes — ops, recon, schedule, leads, revenue, evals, training, claims, knowledge, sessions, github, and eight more — each run on independent systemd timers at cadences from 10 minutes to 60 minutes. One digest lane aggregates findings across all workers.
Each lane follows the same pipeline:
heuristic baseline (deterministic, no model)
↓
model refinement (optional, Gemini with local fallback)
↓
lane guards (domain-specific suppression)
↓
review-queue.json (operator-facing output)
The critical design decision: the heuristic baseline must produce a useful result on its own. The model refinement is an optimizer, not a requirement. If every model backend fails, the heuristic output still lands in the review queue.
Implementation: The Heuristic-First Pipeline
Pipeline Analytics as a Heuristic Source
Each lane has its own heuristic logic. Here's a real example — the revenue lane's pipeline analytics script, which computes CRM funnel health deterministically:
const FUNNEL_ORDER = [
'drafted', 'new', 'outreach_sent', 'sent',
'responded', 'qualified', 'meeting', 'won',
'deprioritized',
];
const STUCK_STATUSES = new Set(['outreach_sent', 'responded', 'qualified']);
const STUCK_DAYS = 7;
The stuck-lead detector scans every contact in the CRM, computes the last meaningful activity date, and flags anything idle for more than seven days:
function getLatestInternalActivityDate(contact, followUpDrafts) {
const crmAction = isIsoDate(contact?.last_action)
? contact.last_action : '';
const byId = followUpDrafts.byId.get(String(contact?.id || ''));
const companyKey = String(contact?.company || '').trim().toLowerCase();
const byCompany = companyKey
? followUpDrafts.byCompany.get(companyKey) : '';
return laterIsoDate(crmAction, laterIsoDate(byId, byCompany));
}
Notice what this does: it doesn't just check the CRM's last_action field. It also indexes follow-up draft artifacts on disk — files sitting in outreach/follow-ups/YYYY-MM-DD/ directories — and treats those as internal activity. A contact with a queued follow-up draft isn't stuck, even if the CRM hasn't been updated yet.
This detail matters because the revenue lane's supervisor reads the structured summary this script produces. If the summary says "3 stuck leads," the supervisor opens a pipeline:stuck-leads review item. We learned the hard way that ignoring draft artifacts caused false positives — leads with review-ready follow-ups on disk kept appearing as stuck because the CRM last_action hadn't been updated.
The output is entirely deterministic:
const summary = {
generated_at: new Date().toISOString(),
total_contacts: contacts.length,
funnel_counts: funnelCounts,
conversion_rates: conversions,
stuck_leads: {
count: stuckLeads.length,
threshold_days: STUCK_DAYS,
leads: stuckLeads,
},
weekly_velocity: velocity,
alerts: alerts,
health: alerts.length === 0 ? 'healthy' : 'needs_attention',
};
No model was involved. No inference latency. The revenue lane has a ground-truth snapshot it can act on immediately.
Model Refinement: Bounded and Conservative
When a lane opts into AI refinement, the model receives the heuristic baseline as context and can adjust item wording, reprioritize, or surface items the heuristic missed. But the prompt is deliberately constraining:
def build_prompt(lane, context, heuristic):
compact_payload = {
"lane": lane,
"heuristic_baseline": heuristic,
"context": compact_context(context),
}
return (
"You are refining a deterministic supervisor lane result.\n"
"Use ONLY facts from the provided JSON payload. "
"Do not invent artifacts, companies, incidents, or file paths.\n"
"This is an unattended operator monitor, so be conservative. "
"If uncertain, keep or lower severity; do not speculate.\n"
'Return ONLY valid JSON in this exact shape: '
'{"lane":"%s","summary":"...","items":[{"source":"...",'
'"key":"...","label":"review|escalate",'
'"reason":"...","evidence":["..."]}]}\n'
"Rules:\n"
"- Keep at most 12 items.\n"
"- Only include items that deserve operator review.\n"
"- Evidence strings must be short and directly grounded "
"in the payload.\n"
) % (lane, json.dumps(compact_payload, ensure_ascii=True, indent=2))
Three constraints worth highlighting:
- Max 12 items. The model can't dump 50 issues into the queue.
- Only
revieworescalatelabels. No invented severity levels. - Evidence must be grounded in the payload. The model can't cite data it wasn't given.
The context itself is aggressively compacted before reaching the model — dictionaries capped at 16 keys, lists at 8 items, strings at 280 characters, recursion limited to 4 levels:
def compact_context(value, depth=0):
if depth >= DEFAULT_MAX_CONTEXT_DEPTH:
return clean_text(value, DEFAULT_MAX_CONTEXT_CHARS)
if isinstance(value, dict):
compacted = {}
for index, (key, child) in enumerate(value.items()):
if index >= DEFAULT_MAX_CONTEXT_DICT:
compacted["__truncated_keys__"] = (
len(value) - DEFAULT_MAX_CONTEXT_DICT
)
break
compacted[str(key)] = compact_context(child, depth + 1)
return compacted
# ... lists, strings, primitives similarly bounded
This isn't just about token costs. It's about limiting the model's surface area for hallucination. The less context it sees, the less it can invent.
The Fallback Chain
The refinement step runs Gemini as the primary backend with a 120-second timeout. If that fails — auth expiry, rate limit, network issue — it falls back to a local Ollama instance running Qwen 2.5 (0.5B) with a 45-second timeout:
def run_refinement(prompt, lane, args, context):
primary_backend = "gemini-cli"
errors = []
try:
stdout, stderr = run_gemini(
prompt, args.gemini_bin, args.model, args.timeout_ms
)
parsed = validate_parsed(
lane, extract_json_payload(stdout), context=context,
current_backend=primary_backend,
current_model=args.model,
primary_backend=primary_backend,
primary_model=args.model,
)
return {"parsed": parsed, "raw_output": stdout.strip()}
except Exception as exc:
errors.append(str(exc))
fallback_backend = normalized_fallback_backend(args.fallback_backend)
if fallback_backend == "ollama":
try:
stdout, _ = run_ollama(
prompt, args.qwen_model,
args.qwen_base_url, args.qwen_timeout_ms
)
parsed = validate_parsed(
lane, extract_json_payload(stdout), context=context,
current_backend="ollama",
current_model=args.qwen_model,
primary_backend=primary_backend,
primary_model=args.model,
fallback_backend="ollama",
fallback_model=args.qwen_model,
fallback_reason=errors[-1] if errors else None,
)
return {"parsed": parsed, "raw_output": stdout.strip()}
except Exception as exc:
errors.append(str(exc))
raise RuntimeError(" | ".join(errors))
The metadata tracks which backend actually produced the result — _fallback_model_used, _fallback_reason — so we know when the primary is degraded without having to check logs.
The Lane Guard Pattern: Where the Real Work Lives
Here's the problem the guards solve: the AI model doesn't know what you've already decided about an item. It sees an eval case with blockedRecheckCount: 1 and thinks "interesting, let me escalate this." But that count means a human already reviewed the item and explicitly blocked it for recheck. The model is re-opening a closed decision.
Lane guards sit between model output and the review queue. They're pure Python functions that apply domain-specific suppression rules:
def apply_lane_guards(lane, items, context):
if lane == "knowledge":
security_lead_states = knowledge_security_lead_state_map(context)
guarded = []
for item in items:
state = security_lead_states.get(
clean_text(item.get("key"), 160).lower()
)
if (isinstance(state, dict)
and clean_text(state.get("stage"), 40).lower()
== "resolved"):
continue # suppress resolved items
guarded.append(item)
return guarded
The knowledge lane guard is simple: if a security lead's stage is resolved, don't let the model resurface it. But the evals and training lane guards are where things get interesting — and where we hit the bug.
Multi-Condition Guard Logic
The evals lane has six distinct suppression rules that execute in sequence:
if lane not in {"evals", "training"}:
return items
expected_labels = eval_expected_label_map(context)
experiment_scoreboard = eval_experiment_scoreboard_map(context)
latest_experiments = eval_latest_experiment_log_map(context)
latest_selectors = eval_latest_prospecting_selector_map(context)
guarded = []
seen = set()
for item in items:
adjusted = dict(item)
expected_label = expected_labels.get(adjusted.get("key"))
# Rule 1: Don't let model re-upgrade a "good" artifact
if expected_label == "good":
continue
# Rule 2: Suppress stale selector history
if (adjusted.get("label") == "review"
and isinstance(selector_tuple, dict)):
latest_selector = latest_selectors.get(
selector_tuple["reportType"]
)
if isinstance(latest_selector, dict):
if (latest_selector.get("provider")
!= selector_tuple["provider"]
or latest_selector.get("model")
!= selector_tuple["model"]):
continue
# Rule 3: Suppress experiment-log items superseded
# by a later good/bad outcome
if (adjusted.get("label") == "review"
and isinstance(latest_experiment, dict)):
if (latest_experiment.get("resultLabel")
and latest_experiment.get("resultLabel") != "review"):
continue
# Rule 4: Downgrade escalations when expected label
# is "review" — respect human gating
if (adjusted.get("label") == "escalate"
and expected_label == "review"):
adjusted["label"] = "review"
# Rule 5: Deduplicate by (source, key, label)
dedupe_key = (
clean_text(adjusted.get("source"), 160),
clean_text(adjusted.get("key"), 160),
clean_text(adjusted.get("label"), 32),
)
if dedupe_key in seen:
continue
seen.add(dedupe_key)
guarded.append(adjusted)
return guarded
Each rule addresses a specific failure mode we observed in production:
- Rule 1 stops the model from re-upgrading items a human marked as
good - Rule 2 suppresses historical data about model selectors we've since replaced
- Rule 3 prevents old experiment results from re-opening after a newer run superseded them
- Rule 4 ensures the model can't escalate something a human deliberately left at review severity
- Rule 5 catches the model emitting the same item twice with slightly different wording
What Surprised Us
The Alias Bug That Took Two Days to Find
Our eval cases carry full artifact IDs like prospecting:2026-03-28-security-audit-leads.json. The training rows carry the same data. But the model — when referencing these items — would emit a shortened key: 2026-03-28-security-audit-leads.
The original guard logic only indexed items by their full ID. So when the model emitted the short key, the guard couldn't find the item's expectedLabel, the suppression didn't fire, and the supervisor kept re-opening March training review items that had already been handled.
The fix was a key expansion in the expected-label index:
def eval_expected_label_map(context):
mapping = {}
for field_name in ("eval_cases", "training_rows"):
for record in context.get(field_name, []):
key = clean_text(record.get("id"), 160)
label = clean_text(
record.get("expectedLabel"), 32
).lower()
aliases = [key]
if key.startswith("prospecting:"):
short_key = key[len("prospecting:"):]
aliases.append(short_key)
if short_key.endswith(".json"):
aliases.append(short_key[:-5])
for alias in aliases:
if alias and alias not in mapping:
mapping[alias] = label
return mapping
Three aliases for every prospecting key: the full ID, the prefix-stripped version, and the extension-stripped version. This also applied to the training lane — the original guard only ran on evals, which meant the training lane had no protection against the same aliasing problem.
The "Activity" Definition Mismatch
The revenue lane's stuck-lead detection originally only checked the CRM's last_action date. But our workflow generates follow-up draft artifacts before they're actually sent — files sitting in dated directories on disk. A contact with a draft queued for review tomorrow isn't stuck. But the CRM doesn't know about drafts.
We had to build a follow-up draft index that scans date-organized directories, extracts contact IDs and company names from markdown frontmatter, and merges those dates into the activity calculation. Without it, the revenue lane was generating 3-5 false stuck-lead alerts per day.
Model Output Isn't Always JSON
Even with explicit "return ONLY valid JSON" instructions, models sometimes wrap output in markdown fences, prepend explanatory text, or emit arrays instead of objects. Our JSON extraction tries four parsing strategies in order:
def extract_json_payload(text):
candidates = [trimmed]
# Try markdown-fenced JSON
fenced = re.search(
r"```(?:json)?\s*([\s\S]*?)```", text, re.IGNORECASE
)
if fenced:
candidates.append(fenced.group(1).strip())
# Try extracting the outermost object
object_start = trimmed.find("{")
object_end = trimmed.rfind("}")
if object_start != -1 and object_end > object_start:
candidates.append(trimmed[object_start:object_end + 1])
# Try extracting the outermost array
array_start = trimmed.find("[")
array_end = trimmed.rfind("]")
if array_start != -1 and array_end > array_start:
candidates.append(trimmed[array_start:array_end + 1])
for candidate in candidates:
try:
return json.loads(candidate)
except Exception:
continue
raise ValueError("Unable to extract valid JSON")
This isn't elegant. It's battle-tested. Every candidate is tried until one parses. The 0.5B local fallback model is particularly prone to wrapping JSON in conversational text.
Lessons Learned
1. Deterministic heuristics are your foundation, not your fallback. We started by treating heuristics as a "fallback when the model is down." We ended up inverting that. The heuristic result is the primary output. The model is an optional refinement layer. This means a model outage degrades quality slightly — it doesn't break the system.
2. Guard logic is more important than model quality. We spent a week tuning prompts to stop the model from re-opening resolved issues. Then we spent a day writing guard functions that suppress them deterministically. The guards were 10x more reliable and took a fraction of the effort.
3. Key normalization in suppression logic needs alias expansion. If your guard looks up items by key, and any upstream system can emit a shortened or variant form of that key, your guard will silently fail. Index every alias you've ever seen in production. Then add one more.
4. Track which model backend produced each result. When your primary model degrades, you need to know immediately — not by reading logs, but by seeing _fallback_model_used: true in the structured output. We track primary backend, current backend, fallback reason, and model descriptor on every single lane result.
5. Compact your context aggressively. The 16-key, 8-item, 4-depth, 280-character limits aren't just about token costs. They're about reducing the model's surface area for hallucination. The less irrelevant data the model sees, the less likely it is to invent issues from noisy context fields.
Conclusion
The supervisor processes signals from 25 services across 20 lanes, running on timers from every 10 minutes to every 60 minutes. It handles the equivalent of a junior SRE's monitoring rotation — flagging failed units, stuck pipeline leads, stale research artifacts, and model evaluation regressions — without a model being in the critical path.
The pattern is simple: compute a deterministic baseline, optionally refine with a bounded model call, then guard the output with domain-specific suppression rules. The guards encode your operational decisions as code. The model can suggest; the guards enforce.
We've been running this in production for months. The alias bug taught us that suppression logic is only as good as your key normalization. The activity-definition mismatch taught us that "last action" means different things in different parts of the system. And the JSON extraction gauntlet taught us that even the most explicit prompt won't stop a model from wrapping its output in markdown.
If you're building operational monitoring for a multi-service system and you're tempted to start with an AI model reading your logs — start with deterministic heuristics instead. Add the model later, behind guards. You'll sleep better.
Need help building AI agent systems or designing multi-agent architectures? Ledd Consulting specializes in autonomous workflow design and agent orchestration for enterprise teams.