postmortem

How a Privilege Boundary Turned a Single 401 Into 205 Consecutive Failures

Ledd Consulting

06 Apr 2026 — 7 min read

At Ledd Consulting, we run 25 microservices and 7 AI agents on a single VPS. We built a session reliability watcher that detects agent failures, applies exponential backoff, and auto-restarts degraded services. On April 3rd, a single expired authentication token on one agent exposed a blind spot we hadn't considered: our self-healing system ran as an unprivileged user, and the only fix for this failure type — restarting the auth sync service — required root. The system faithfully detected the problem every 15 minutes for over three days, grew its backoff timers to 60 seconds, activated fallback mode, filed 205 identical alerts, and never once fixed the issue.

Summary

Our Gemini-backed research agent ("hermes-gemini") began returning 401 auth errors on the evening of April 3rd. Our session reliability watcher detected the failure within its first check cycle and attempted remediation. The restart action failed because the watcher process runs as a non-root user and cannot issue systemctl restart against the auth sync service. The system escalated to exponential backoff and fallback mode, but the root cause — a stale credential — persisted. Over the next 72+ hours, the watcher logged 205 consecutive failure records at 15-minute intervals before a human operator intervened. Downstream impact was limited because fallback mode routed work to alternative providers, but the incident revealed that our auto-remediation had a privilege gap that turned a 30-second fix into a multi-day degradation.

Timeline

April 3, 23:25 UTC — Session reliability watcher detects hermes-gemini as unhealthy. consecutiveFailures: 2. Alert level: normal. Backoff timer set to 3,000ms.

April 3, 23:29 UTC — Failure count hits 3, exceeding maxRetries. Fallback mode activates. The watcher attempts restart_service for gemini-auth-sync.service — immediately fails with "system restart requires privileged operator". Backoff timer grows to 10,000ms.

April 3, 23:50 UTC — consecutiveFailures: 5. Alert level escalates from normal to high (threshold: alertOnConsecutiveFailures: 5). Timeout increment kicks in: next diagnostic timeout grows by 5,000ms per cycle.

April 4, 01:20 UTC — consecutiveFailures: 11. Backoff timer has hit the 60,000ms ceiling. Every 15-minute check now logs an identical record: restart attempted=false, retry backoff=10,000ms, next timeout=60,000ms. The system is in a stable failure loop.

April 4, 14:05 UTC — consecutiveFailures: 62. System has been logging identical failure records for 15 hours. Fallback providers (Claude, Codex) absorb hermes-gemini's workload silently.

April 6, ~01:50 UTC — consecutiveFailures: 205. An operator finally notices the accumulated alerts during a morning review and runs sudo systemctl restart gemini-auth-sync.service. The fix takes under 30 seconds.

April 6, ~02:05 UTC — Next watcher cycle detects hermes-gemini as healthy. consecutiveFailures resets to 0, fallbackActive clears, all backoff timers reset.

Root Cause

The root cause was straightforward: an expired OAuth credential on the Gemini auth sync service. The interesting failure wasn't the credential — those expire all the time. The interesting failure was in our remediation architecture.

Our session reliability watcher runs as a systemd timer under a non-root user. Here's the privilege check that blocked the fix:

async function restartService(serviceName, dryRun) {
  if (!serviceName) return { attempted: false, success: false, reason: 'no service configured' };
  if (dryRun) {
    return { attempted: true, success: true, reason: 'dry-run restart skipped' };
  }
  if (typeof process.getuid === 'function' && process.getuid() !== 0) {
    return {
      attempted: false,
      success: false,
      reason: `system restart requires privileged operator for ${serviceName}`,
      requiresPrivilege: true,
    };
  }
  const result = await execCommand(`systemctl restart ${serviceName}`, 15000);
  return {
    attempted: true,
    success: result.healthy,
    reason: result.healthy ? 'service restarted' : (result.output || result.error || 'restart failed'),
  };
}

That process.getuid() !== 0 check returns the same result every single cycle. The remediation action was configured correctly in the config:

{
  "autoRemediationActions": [
    {
      "trigger": "401_auth_error",
      "action": "restart_service",
      "service": "gemini-auth-sync.service",
      "maxAutoRemediations": 3,
      "alertAfterAction": true
    },
    {
      "trigger": "consecutive_timeout",
      "action": "increase_timeout_and_retry",
      "timeoutIncrementMs": 5000,
      "maxTimeoutMs": 60000
    }
  ]
}

The system knew the right fix was a service restart. It knew it couldn't perform that fix. And then it did the only thing it was authorized to do: increase backoff timers and activate fallback mode. Every 15 minutes for 72 hours.

The cascading behavior looked like this:

Failure detection worked perfectly — diagnostics ran systemctl is-active gemini-auth-sync.service and correctly identified the auth error.
Failure classification worked perfectly — detectFailureTypes() matched 401 and auth patterns to auth_error.
Remediation routing worked perfectly — the config mapped 401_auth_error to restart_service.
Remediation execution hit a wall — unprivileged process can't restart a systemd service.
Backoff escalation worked perfectly — exponential backoff climbed from 1s → 3s → 10s, timeouts grew by 5s per cycle to a 60s ceiling.
Fallback activation worked perfectly — after 3 consecutive failures, the system routed hermes-gemini work to alternative providers.

Every subsystem did exactly what it was designed to do. The architecture failed because no subsystem was designed to handle "I know the fix, but I can't execute it."

The Fix

Immediate: Sudoers Entry for Targeted Restarts

We gave the watcher process the ability to restart specific services — not blanket root access, but scoped sudoers entries for the exact services it monitors:

# /etc/sudoers.d/session-reliability
appuser ALL=(ALL) NOPASSWD: /usr/bin/systemctl restart gemini-auth-sync.service
appuser ALL=(ALL) NOPASSWD: /usr/bin/systemctl restart task-processor.service

The restartService function now uses the sudo path that our system health monitor already had:

async function restartService(serviceName, dryRun) {
  if (!serviceName) return { attempted: false, success: false, reason: 'no service configured' };
  if (dryRun) {
    return { attempted: true, success: true, reason: 'dry-run restart skipped' };
  }
  // Use sudo with the restricted sudoers entry instead of requiring root
  const result = await execCommand(
    `sudo /usr/bin/systemctl restart ${serviceName}`, 15000
  );
  return {
    attempted: true,
    success: result.healthy,
    reason: result.healthy ? 'service restarted' : (result.output || result.error || 'restart failed'),
  };
}

This mirrors what our system health monitor already does successfully:

function restartService(service) {
  const svcName = service.replace('.timer', '.service');
  try {
    execSync(`sudo /usr/bin/systemctl restart ${svcName}`, { timeout: 30000 });
    log(`Restarted ${svcName}`);
    return true;
  } catch (err) {
    log(`Failed to restart ${svcName}: ${err.message}`);
    return false;
  }
}

We had already solved this exact problem in the health monitor. The session reliability watcher was built later by a different agent session and didn't inherit the pattern.

Structural: Escalation-Aware Remediation

The deeper fix was adding an escalation path. When auto-remediation fails due to privilege constraints, the system now has three escalation tiers instead of one:

// After: escalation-aware remediation
if (restart.requiresPrivilege && !restart.success) {
  // Tier 1: Try sudo path (newly available via sudoers)
  const sudoRestart = await execCommand(
    `sudo /usr/bin/systemctl restart ${serviceName}`, 15000
  );
  if (sudoRestart.healthy) {
    return { attempted: true, success: true, reason: 'service restarted via sudo' };
  }

  // Tier 2: Delegate to system health monitor (runs every 15 min, has sudo)
  await postSignedJson('127.0.0.1', 8080, '/trigger', {
    type: 'system.service_down',
    data: { serviceName, source: 'session-reliability-watcher' }
  }, { source: 'session-reliability' });

  // Tier 3: Emit high-priority operator alert with exact fix command
  return {
    attempted: false,
    success: false,
    reason: `privilege escalation failed — operator action required: sudo systemctl restart ${serviceName}`,
    requiresPrivilege: true,
    escalated: true,
  };
}

Prevention

1. Privilege Audit Across All Auto-Remediation Paths

We audited every remediation action across all 25 services and asked one question: "Can the process that detects this failure also fix it?" We found three more cases where the answer was no — timer restarts, log rotation triggers, and certificate renewal. All now have scoped sudoers entries.

2. Staleness Detection on Consecutive Failure Runs

We added a circuit breaker that detects when the watcher is stuck in a stable failure loop — same failure type, same remediation result, no progress:

// If last 5 remediation results are identical, escalate instead of repeating
const lastFiveIdentical = state.lastRemediation.length >= 5
  && new Set(state.lastRemediation.slice(-5).map(r => r.reason)).size === 1;
if (lastFiveIdentical) {
  await sendNotification({
    source: 'session-reliability',
    priority: 'critical',
    category: 'system',
    title: `Stuck remediation loop: ${agentName}`,
    message: `${state.consecutiveFailures} consecutive failures with identical remediation result: "${state.lastRemediation[0]?.reason}". Manual intervention likely required.`
  });
}

3. Cross-System Health Monitor Integration

Our system health monitor already ran every 15 minutes with restart capabilities for critical services. But it monitored systemd service status, not agent session health. The session reliability watcher monitored agent session health, not systemd service status. Neither talked to the other.

Now the session reliability watcher fires a system.service_down event through our signed IPC layer when it identifies a service that needs restarting:

const headers = createSignedHeaders(payload, 'session-reliability', {
  'Content-Type': 'application/json',
});

await postSignedJson('127.0.0.1', 8080, '/trigger', {
  type: 'system.service_down',
  data: {
    serviceName: action.service,
    services: [action.service],
    issues: [{ service: action.service, issue: 'auth_error', critical: true }]
  }
}, { source: 'session-reliability' });

The health monitor picks up this event on its next cycle and performs the restart with its existing sudo capability.

Lessons for Your Team

1. Test remediation paths end-to-end, not just detection paths. We had comprehensive tests for failure detection and classification. Zero tests for whether remediation actions could actually execute. A simple integration test that triggered a restart in staging would have caught the privilege gap immediately.

2. Self-healing systems need a "stuck" detector. Exponential backoff is great for transient failures. For persistent failures, it just makes your logs bigger. If the same remediation result appears 5 times in a row with no improvement, that's not backoff — that's denial. Escalate.

3. Two monitors watching the same system from different angles won't talk to each other by default. Our system health monitor watched service status. Our session reliability watcher watched agent behavior. Both detected the same underlying problem. Neither could fix it alone, and neither knew the other existed. If you have overlapping observability systems, wire them together — at minimum, let them fire events that the other can consume.

4. The privilege boundary you set for security becomes the ceiling for your automation. We deliberately ran the session reliability watcher as a non-root process. That was the right security decision. But we didn't complete the thought: if this process can't restart services, who can, and how does it ask? Least-privilege is only half the design. The other half is the escalation path.

5. Fallback mode can mask urgency. Because our fallback providers absorbed the workload silently, the system kept functioning at reduced capacity. The 205 consecutive failure alerts felt like noise, not a fire. If fallback mode is working, your incentive to fix the underlying issue drops. Consider time-bounding fallback — if it's been active for N hours, escalate regardless.

Conclusion

The irony of this incident is that everything worked. Detection, classification, backoff, fallback — all functioning exactly as designed. The system correctly identified a 401 auth error, correctly determined the fix was a service restart, correctly determined it couldn't perform that restart, and then correctly logged that conclusion 205 times over three days.

Self-healing systems are only as powerful as the actions they're authorized to take. If your detection is excellent but your remediation hits a privilege wall, you haven't built self-healing — you've built self-diagnosing. The gap between "I know what's wrong" and "I can fix it" is where multi-day degradations hide.

Need help building AI agent systems or designing multi-agent architectures? Ledd Consulting specializes in autonomous workflow design and agent orchestration for enterprise teams.