Detecting Behavioral Drift in Autonomous Agent Fleets With Rolling Z-Scores

When you have one autonomous agent, you watch it. When you have twenty-one, you need math.

At Ledd Consulting, we run a fleet of 21 autonomous agents across 35 services — handling everything from content publishing to lead qualification to marketplace operations. These agents run on schedules and event triggers, making real decisions with real consequences. For months, our monitoring was binary: is the service up, or is it down? That caught crashes. It did not catch the day our content agent silently started publishing 60% fewer posts, or the morning our job pipeline agent stopped submitting proposals entirely while still reporting healthy heartbeats.

The service was "up." The behavior had drifted. Nobody noticed for three days.

We built a drift detection system that catches these behavioral changes within 24 hours. Here's the exact pattern we extracted from production.

The Pattern — Rolling Z-Score Behavioral Drift Detection

One-sentence definition: Compute rolling statistical baselines from N days of agent behavioral metrics, then flag any metric whose current value deviates beyond a z-score threshold as behavioral drift.

This is not application performance monitoring. APM tells you latency is high. Drift detection tells you your agent is behaving differently than it has for the past two weeks — even when every health check is green.

The Naive Approach (and Why It Fails)

Most teams start with static thresholds. "Alert if proposals submitted drops below 5 per day." "Alert if latency exceeds 500ms." We did too.

Here's why static thresholds fail for autonomous agents:

1. Agent behavior is inherently variable. Our marketplace agents handle different volumes on weekdays versus weekends. A static threshold either fires false alarms every Saturday or misses real degradation on Tuesday.

2. Agents evolve. We ship prompt updates, adjust model tiers, add new capabilities. Last month's "normal" is not this month's "normal." Static thresholds require constant manual recalibration.

3. The failure mode is silence, not explosion. When a traditional service fails, requests error out. When an autonomous agent drifts, it keeps running — it just makes fewer decisions, generates less output, or shifts its action distribution. The health check passes. The behavior is wrong.

The second approach teams try is percentage-change alerting: "Alert if metric drops 30% from yesterday." This catches cliff-edge failures but misses gradual drift. An agent that degrades 5% per day for two weeks has lost half its output, and no single day triggered the 30% threshold.

Rolling z-scores solve both problems. They adapt to the agent's actual behavioral distribution and catch both sudden shifts and gradual drift.

Pattern Implementation

Here's the full pattern, extracted from our production drift detector. We'll walk through three stages: metric extraction, baseline computation, and drift detection.

Stage 1: Metric Extraction

First, we need to turn raw agent activity into numerical metrics. We pull from daily analytics snapshots — JSON files our analytics service writes at midnight:

function getAnalyticsSnapshots(days = 14) {
  const snapshots = [];
  const files = fs.readdirSync(ANALYTICS_DIR)
    .filter(f => f.startsWith('daily-') && f.endsWith('.json'))
    .sort();
  for (const file of files.slice(-days)) {
    const data = readJSON(path.join(ANALYTICS_DIR, file));
    if (data) snapshots.push(data);
  }
  return snapshots;
}

The 14-day window is deliberate. Shorter windows (3–5 days) make the baseline too reactive — a bad week becomes the new "normal." Longer windows (30+ days) make it too slow to adapt to legitimate changes. Fourteen days gives us two full weekday/weekend cycles, which captures the natural rhythm of our agent fleet.

From each snapshot, we extract every behavioral metric we care about:

function extractDailyMetrics(snapshot) {
  const metrics = {
    date: snapshot.date || 'unknown',
    // Service health
    services_up: snapshot.service_health?.services_up || 0,
    services_total: snapshot.service_health?.services_total || 0,
    avg_latency_ms: snapshot.service_health?.avg_latency_ms || 0,
    // Marketplace
    marketplace_queries: snapshot.marketplace?.total_queries || 0,
    marketplace_revenue: snapshot.marketplace?.total_revenue_usd || 0,
    active_agents: snapshot.marketplace?.active_agents || 0,
    // Content
    ghost_posts: snapshot.ghost?.total_posts || 0,
    ghost_members: snapshot.ghost?.total_members || 0,
    // Social
    lens_posts: snapshot.social?.lens?.total_posts || 0,
    farcaster_casts: snapshot.social?.farcaster?.total_casts || 0,
    // Jobs
    total_proposals: snapshot.job_pipeline?.total_proposals || 0,
    proposals_pending: snapshot.job_pipeline?.pending || 0,
    proposals_submitted: snapshot.job_pipeline?.submitted || 0,
    // Automated workflows
    workflow_actions: snapshot.workflows?.total_actions || 0,
    knowledge_entries: snapshot.workflows?.knowledge_entries || 0,
  };
  return metrics;
}

Notice: we track behavioral outputs, not just system health. proposals_submitted, workflow_actions, ghost_posts — these are the metrics that tell you whether agents are doing their jobs, not just whether they're alive.

We also extract action-level metrics to detect shifts in what types of actions agents are taking:

function extractActionMetrics(actions, dateStr) {
  const dayActions = actions.filter(
    a => a.timestamp && a.timestamp.startsWith(dateStr)
  );
  const typeCounts = {};
  for (const a of dayActions) {
    const t = a.type || 'unknown';
    typeCounts[t] = (typeCounts[t] || 0) + 1;
  }
  return {
    total_actions: dayActions.length,
    action_types: typeCounts,
  };
}

This is critical. An agent that takes the same number of actions but shifts from submit_proposal to skip_proposal won't trip a volume-based alert. Action type distribution catches it.

Stage 2: Baseline Computation

With 14 days of metrics, we compute a statistical baseline for every metric — mean, standard deviation, min, max:

function computeBaselines(snapshots) {
  if (snapshots.length < 2) return null;

  const metrics = snapshots.map(extractDailyMetrics);
  const keys = Object.keys(metrics[0]).filter(k => k !== 'date');
  const baselines = {};

  for (const key of keys) {
    const values = metrics.map(m => m[key]).filter(v => typeof v === 'number');
    if (values.length === 0) continue;
    const mean = values.reduce((a, b) => a + b, 0) / values.length;
    const variance = values.reduce((a, b) => a + (b - mean) ** 2, 0) / values.length;
    const stddev = Math.sqrt(variance);
    baselines[key] = {
      mean: Math.round(mean * 100) / 100,
      stddev: Math.round(stddev * 100) / 100,
      min: Math.min(...values),
      max: Math.max(...values),
      samples: values.length,
    };
  }

  return baselines;
}

Two implementation details that matter:

We use population standard deviation, not sample. We're characterizing the full behavioral window, not estimating a population parameter. For a 14-day window, the difference is marginal, but it's a deliberate choice.

We require at least 2 snapshots. A single data point has zero variance — every z-score would be either zero or infinity. In practice, we don't trust baselines until we have at least 7 days of data. A fresh deployment gets a grace period.

Stage 3: Drift Detection

Now the core of the pattern — comparing today's metrics against the rolling baseline using z-scores:

function detectDrift(currentMetrics, baselines, threshold = 2.0) {
  const drifts = [];

  for (const [key, baseline] of Object.entries(baselines)) {
    const current = currentMetrics[key];
    if (typeof current !== 'number' || baseline.stddev === 0) continue;

    const zScore = Math.abs(current - baseline.mean) / baseline.stddev;
    if (zScore > threshold) {
      const direction = current > baseline.mean ? 'above' : 'below';
      drifts.push({
        metric: key,
        current,
        baseline_mean: baseline.mean,
        baseline_stddev: baseline.stddev,
        z_score: Math.round(zScore * 100) / 100,
        direction,
        severity: zScore > 3 ? 'high' : 'medium',
      });
    }
  }

  return drifts;
}

The threshold = 2.0 default means we flag anything more than 2 standard deviations from the mean. In a normal distribution, that's roughly 5% of observations. For agent behavior — which is not perfectly normal — we found this catches real drift without drowning us in noise.

The stddev === 0 guard is essential. If a metric has been perfectly constant for 14 days (e.g., services_total is always 35), any deviation produces an infinite z-score. We skip these metrics entirely — a constant metric changing is better caught by a simple equality check.

Severity tiering: z > 3 is high, z > 2 is medium. In production, high-severity drift fires an immediate alert to our notification service via the event bus. Medium-severity drift gets logged to drift history and included in the daily digest.

When drift is detected, we push it through our event bus so the rest of the system can react:

const EVENT_BUS_URL = 'http://127.0.0.1:5000/event';

// Inside drift detection handler:
if (drifts.length > 0) {
  const payload = JSON.stringify({
    type: 'drift.detected',
    source: 'drift-detector',
    data: { drifts, timestamp: new Date().toISOString() }
  });

  const req = http.request(EVENT_BUS_URL, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' }
  });
  req.write(payload);
  req.end();
}

Our notification router picks up drift.detected events and dispatches them — high severity goes to Telegram immediately, medium gets batched into the morning digest.

In Production

We've been running this pattern across our full fleet for 11 weeks. Real numbers:

Fleet size: 21 agents across 35 services, 60+ scheduled timers, tracking 17 behavioral metrics per daily snapshot.

Detection latency: Average 18 hours from drift onset to alert. The system checks once per day against the prior day's metrics. We experimented with hourly checks but found the noise-to-signal ratio wasn't worth it — most behavioral drift is a daily-scale phenomenon.

Alert volume: ~3.2 medium-severity drift alerts per week, ~0.4 high-severity per week. Before tuning, we were at 8+ medium alerts per week — we reduced this by adding the stddev === 0 guard and excluding metrics that are structurally spiky (like marketplace_revenue, which can legitimately spike 5x on a single transaction).

Catches we're proud of:

  • Our content publishing agent's output dropped from an average of 4.2 posts/day to 1.1 posts/day over a week. The z-score hit 2.8 on day three and 3.4 by day five. Root cause: a CMS API token had silently expired, and the agent was retrying and failing gracefully without errors.
  • Our job pipeline agent stopped submitting proposals entirely — proposals_submitted went to zero while proposals_pending stayed normal. Z-score: effectively infinity (caught by a separate zero-check we added after the stddev guard). Root cause: an OAuth integration had broken, and the agent was classifying every job as "not ready to submit."
  • Average fleet latency crept from 180ms to 340ms over 10 days. No single day triggered a static 500ms threshold. The z-score crossed 2.0 on day six at 285ms. Root cause: a memory leak in one of the observer services.

False positive we learned from: Weekend content volume is naturally lower. Early on, every Monday morning we'd get a drift alert because Sunday's metrics pulled down the rolling average. We solved this by ensuring the 14-day window always contains at least two full weeks — so weekday and weekend patterns are both represented in the baseline.

Variations

Weighted Recency

Our baseline treats all 14 days equally. For faster-evolving systems, apply exponential decay to weight recent days more heavily:

// Weighted mean with exponential decay
const alpha = 0.1; // decay factor
let weightedSum = 0, weightTotal = 0;
for (let i = 0; i < values.length; i++) {
  const weight = Math.exp(-alpha * (values.length - 1 - i));
  weightedSum += values[i] * weight;
  weightTotal += weight;
}
const weightedMean = weightedSum / weightTotal;

We haven't needed this yet — our 14-day uniform window works well for our fleet's pace of change. But if you're shipping prompt updates daily, a 7-day window with alpha = 0.15 will adapt faster.

Per-Agent Baselines

We compute baselines across the entire fleet's daily snapshot. If your agents have fundamentally different behavioral profiles, compute baselines per agent. The tradeoff: you need more history per agent to get stable statistics, and you lose the ability to detect fleet-wide drift (e.g., a model provider degradation affecting all agents simultaneously).

Multi-Metric Correlation

A single metric drifting might be noise. Two correlated metrics drifting together is almost always real. We're experimenting with a simple heuristic: if 3+ metrics drift in the same direction on the same day, auto-escalate to high severity regardless of individual z-scores.

Adaptive Thresholds

Instead of a fixed z > 2.0 threshold, set per-metric thresholds based on the metric's coefficient of variation (stddev/mean). High-variance metrics like marketplace_queries get a looser threshold (z > 2.5); low-variance metrics like services_up get a tighter one (z > 1.5). We hardcode our thresholds today, but this is the next iteration.

Conclusion

Autonomous agents fail quietly. They don't throw exceptions when they drift — they just do less, do it differently, or do the wrong thing. Traditional monitoring answers "is it running?" Drift detection answers "is it behaving the way it was behaving?"

The rolling z-score pattern is simple enough to implement in a single file (ours is under 300 lines), cheap enough to run on any infrastructure (it reads JSON files and does arithmetic), and effective enough to catch behavioral changes that static thresholds and percentage-change alerts miss entirely.

If you take one thing from this post: track behavioral outputs, not just system health. proposals_submitted matters more than cpu_usage. posts_published matters more than response_time. The metrics that tell you whether your agents are doing their jobs are the ones worth computing baselines for.

Running autonomous agents without drift detection is like flying blind. We build monitoring that catches behavioral changes before they become incidents.

Need help building AI agent systems or designing multi-agent architectures? Ledd Consulting specializes in autonomous workflow design and agent orchestration for enterprise teams.

Read more

Intelligence Brief — Saturday, April 11, 2026

MetalTorque Daily Brief — 2026-04-11 Cross-Swarm Connections The Audit Trail Is the Attack Surface — Everywhere. Three swarms converged on the same structural conclusion from radically different entry points. Agentic Design found that peer-preservation corrupts agent-generated logs, confidence inflation poisons self-reported metrics, and context contamination makes audit-time behavior diverge from production behavior.

By Ledd Consulting