When a 401 Error Became Valid Data: How Poisoned Rows Survived Two Days Undetected


Summary

On April 8, 2026, our research analysis service lost access to its primary AI provider when a Groq API key expired. Instead of failing loudly, the service treated the provider's 401 Invalid API Key error string as a valid analysis result — and wrote it into the output store alongside real data. By the time we caught it on April 9, two days of analysis output contained poisoned rows: error messages masquerading as legitimate research summaries. Downstream consumers had already ingested some of them. Detection-to-full-resolution took roughly six hours, including a manual data purge, a code fix to treat auth failures as retryable errors with local model fallback, and a controlled backfill at reduced concurrency.

Timeline

April 8, ~03:00 UTC — API key expires silently. Our Groq API key hit its rotation deadline. No alert fired because the key didn't fail on a health check — it failed on the next real request.

April 8, 03:12–04:40 UTC — Poisoned rows start accumulating. The research analyzer service continued its normal processing loop. Each analysis request returned a 401 response. The service extracted the error body — Analysis failed: 401 Invalid API Key — and stored it as the analysis result for that research item. The service reported healthy. Timers showed green.

April 8, ~14:00 UTC — Downstream digest picks up bad data. Our daily digest pipeline consumed the latest analysis output. Several entries contained the raw error string instead of structured analysis. The digest rendered them as-is — short, nonsensical "summaries" that didn't trigger length or format guards because they were within token bounds.

April 9, 08:30 UTC — Manual review catches the anomaly. During morning briefing review, we noticed multiple digest entries that were identical one-line strings. Grep confirmed the pattern across two days of output.

April 9, 09:00 UTC — Root cause identified. The analyzer's result-handling path had no distinction between a successful AI response and a caught-exception string. Both flowed through the same write path.

April 9, 09:45 UTC — Poisoned rows purged. We identified and deleted all rows containing the error signature from April 8–9.

April 9, 11:30 UTC — Fix deployed. New code treats provider auth and availability errors as retryable, falls back to a local Ollama instance, and validates response structure before persisting.

April 9, 14:00 UTC — Backfill complete. Clean re-analysis of all affected items, running at concurrency 1 against the local model to avoid overloading the VPS.

Root Cause

The failure had three layers, and all three had to be present for the poisoning to occur.

Layer 1: Error responses treated as valid output

The original analysis function caught exceptions and returned them as strings. This is the pattern that made everything downstream trust bad data:

async function analyzeItem(item) {
  try {
    const response = await groqClient.chat.completions.create({
      model: 'llama-3.3-70b-versatile',
      messages: [
        { role: 'system', content: ANALYSIS_SYSTEM_PROMPT },
        { role: 'user', content: item.content }
      ]
    });
    return response.choices[0].message.content;
  } catch (err) {
    return `Analysis failed: ${err.message}`;
  }
}

That return in the catch block is the entire root cause. The function's contract says "I return a string" — and it does, always. The caller had no way to distinguish a valid analysis from an error message. Both were strings. Both got written to the output store.

Layer 2: No structural validation on output

The persistence layer accepted any string:

const result = await analyzeItem(item);
await writeResult(item.id, {
  analysis: result,
  analyzedAt: new Date().toISOString(),
  source: 'groq'
});

There was no check for minimum length, no schema validation, no sentinel detection. A 40-character error string and a 2,000-character analysis both flowed through identically.

Layer 3: No provider health tracking

The service processed items in a loop with configurable concurrency. When every single request returned a 401, the service didn't notice the pattern. It processed 100% of items, "succeeded" on all of them (no uncaught exceptions), and reported a clean run.

const items = await getUnanalyzedItems();
const results = await pLimit(CONCURRENCY)(
  items.map(item => () => analyzeItem(item))
);
console.log(`Analyzed ${results.length} items`);

The log said Analyzed 47 items. It didn't lie — it did analyze 47 items. It just happened that every "analysis" was an error string.

The Fix

The fix addressed all three layers. Here's the before/after for each.

Fix 1: Errors are errors, not return values

Before:

async function analyzeItem(item) {
  try {
    const response = await groqClient.chat.completions.create({
      model: 'llama-3.3-70b-versatile',
      messages: [
        { role: 'system', content: ANALYSIS_SYSTEM_PROMPT },
        { role: 'user', content: item.content }
      ]
    });
    return response.choices[0].message.content;
  } catch (err) {
    return `Analysis failed: ${err.message}`;
  }
}

After:

async function analyzeItem(item) {
  const providers = [
    { name: 'groq', fn: () => analyzeWithGroq(item) },
    { name: 'ollama', fn: () => analyzeWithOllama(item) }
  ];

  for (const provider of providers) {
    try {
      const result = await provider.fn();
      if (!result || result.length < 100) {
        throw new Error(`${provider.name} returned suspiciously short result (${result?.length || 0} chars)`);
      }
      return { analysis: result, source: provider.name };
    } catch (err) {
      const isRetryable = err.status === 401 || err.status === 429 
        || err.status === 503 || err.code === 'ECONNREFUSED';
      if (isRetryable) {
        console.warn(`[${provider.name}] retryable failure: ${err.message}, trying next provider`);
        continue;
      }
      throw err; // Non-retryable errors propagate
    }
  }
  
  throw new Error(`All providers exhausted for item ${item.id}`);
}

The key changes: auth and availability errors trigger fallback to the next provider instead of returning an error string. Non-retryable errors throw — they never silently become data. And the function returns a structured object with the source provider, not a bare string.

Fix 2: Structural validation before persistence

const { analysis, source } = await analyzeItem(item);

// Validate structure before writing
if (typeof analysis !== 'string' || analysis.length < 100) {
  throw new Error(`Invalid analysis output: ${analysis?.substring(0, 80)}`);
}
if (analysis.toLowerCase().includes('analysis failed:')) {
  throw new Error(`Error string leaked into analysis output: ${analysis.substring(0, 80)}`);
}

await writeResult(item.id, {
  analysis,
  analyzedAt: new Date().toISOString(),
  source
});

The sentinel check (analysis.toLowerCase().includes('analysis failed:')) is a belt-and-suspenders guard. The upstream fix should prevent error strings from reaching this point. But after watching a 40-character error string survive two days as "valid data," we don't trust any single layer anymore.

Fix 3: Provider health tracking with circuit breaking

let consecutiveFailures = 0;
const FAILURE_THRESHOLD = 5;

for (const item of items) {
  try {
    const result = await analyzeItem(item);
    consecutiveFailures = 0;
    await writeResult(item.id, result);
    successCount++;
  } catch (err) {
    consecutiveFailures++;
    failureCount++;
    console.error(`[analyzer] failure ${consecutiveFailures}/${FAILURE_THRESHOLD}: ${err.message}`);
    
    if (consecutiveFailures >= FAILURE_THRESHOLD) {
      console.error(`[analyzer] ${FAILURE_THRESHOLD} consecutive failures — aborting run`);
      process.exit(1); // Let systemd mark the timer as failed
    }
  }
}

Five consecutive failures now kills the process. We'd rather have systemd mark the timer as failed — which shows up in our monitoring — than silently produce garbage for hours.

Fix 4: Reduced concurrency for local fallback

const CONCURRENCY = process.env.RESEARCH_ANALYZER_CONCURRENCY 
  ? parseInt(process.env.RESEARCH_ANALYZER_CONCURRENCY) 
  : 3;

We set RESEARCH_ANALYZER_CONCURRENCY=1 in the service environment. Our VPS didn't make forward progress reliably with parallel Ollama analysis — the local model would timeout or produce truncated output under concurrent load. Sequential processing is slower but every result is complete.

Prevention

Beyond the code fixes, we made three systemic changes.

1. Error strings are now a monitored class

We added a nightly scan across all service output directories that greps for known error signatures:

const ERROR_SIGNATURES = [
  'analysis failed:',
  'invalid api key',
  'rate limit exceeded',
  'econnrefused',
  'timeout exceeded'
];

function auditOutputDirectory(dir) {
  const files = fs.readdirSync(dir).filter(f => f.endsWith('.json'));
  const poisoned = [];
  for (const file of files) {
    const data = JSON.parse(fs.readFileSync(path.join(dir, file), 'utf-8'));
    const text = JSON.stringify(data).toLowerCase();
    for (const sig of ERROR_SIGNATURES) {
      if (text.includes(sig)) {
        poisoned.push({ file, signature: sig });
      }
    }
  }
  return poisoned;
}

If any output file contains an error signature, the morning briefing flags it for review. This catches the class of bug, not just the specific instance.

2. API key expiration is now tracked

Every external API key in our system now has a expiresAt or rotateBy field in the control plane manifest. A daily check warns 7 days before expiration. The Groq key expired because nobody knew when it was due — it was provisioned once and forgotten.

3. Local model fallback is standard

Every service that calls an external AI provider now has a local Ollama fallback path. This isn't about cost — it's about availability. A local model producing slightly lower quality output is infinitely better than an error string masquerading as a result.

Lessons for Your Team

1. Never return error information through the success path. This is the cardinal sin. If your function returns string on success and string on failure, you have no type-level distinction between good and bad data. Use discriminated returns ({ ok: true, data } / { ok: false, error }), throw exceptions, or return null — anything that forces the caller to handle the failure case explicitly.

2. Validate output structure, not just input. We validated what went into the AI provider but not what came out. Input validation is table stakes. Output validation — checking that what you're about to persist actually looks like what it should — catches an entire class of silent corruption bugs.

3. Consecutive failure counts are your cheapest circuit breaker. You don't need a sophisticated circuit breaker library. A counter that increments on failure and resets on success, with a threshold that kills the process, would have caught this in under a minute instead of two days.

4. "No errors in the logs" doesn't mean "working correctly." Our service logged zero errors. It caught every exception and handled it — by turning it into data. The absence of errors was the error. If your monitoring only watches for error-level log lines, you're blind to this entire failure class.

5. Local model fallback is an availability pattern, not a cost pattern. We run Ollama not because it's cheaper than Groq — it's actually slower and lower quality for our use case. We run it because when the cloud provider is down, we still produce real output. The quality trade-off is worth it compared to producing nothing — or worse, producing error strings that look like something.

Conclusion

This incident was humbling because every individual design choice seemed reasonable in isolation. Catching exceptions? Good practice. Returning a string from a string-returning function? Type-consistent. Processing all items in the queue? That's the job. But composed together, these choices created a system that could silently poison its own data for days without any monitoring signal.

The fix wasn't complex — provider fallback, output validation, consecutive failure tracking. The lesson was: your error handling path is a data path too, and if you don't treat it with the same rigor as your happy path, errors will become data.

Need help building AI agent systems or designing multi-agent architectures? Ledd Consulting specializes in autonomous workflow design and agent orchestration for enterprise teams.

Read more

Intelligence Brief — Saturday, April 11, 2026

MetalTorque Daily Brief — 2026-04-11 Cross-Swarm Connections The Audit Trail Is the Attack Surface — Everywhere. Three swarms converged on the same structural conclusion from radically different entry points. Agentic Design found that peer-preservation corrupts agent-generated logs, confidence inflation poisons self-reported metrics, and context contamination makes audit-time behavior diverge from production behavior.

By Ledd Consulting