cost-management

When Our Cost Tracker Caught a $900 Overnight API Spend — and How We Auto-Capped It

Ledd Consulting

13 Mar 2026 — 7 min read

At Ledd Consulting, we run 25 microservices and 7 autonomous agents on a single VPS. Our daily API bill is typically under $2. One Tuesday morning, it was $900. Here is exactly what happened, how we caught it, and the auto-capping system we built so it stays impossible for this to happen again.

Summary

A misconfigured retry loop in our research pipeline caused one agent to hammer Claude Opus — the most expensive model tier — for six straight hours overnight. Our daily cost report, which ran once at 7 AM UTC, emailed the damage after $912.47 had already been spent. Total time from detection to fix: 94 minutes. Total time spent building the prevention layer that now protects every service: two days. The core lesson: a daily cost email is an autopsy report, and what you actually need is a circuit breaker.

Timeline

01:00 UTC — Scheduled research pipeline kicks off. Six parallel workers begin processing queued tasks. Under normal conditions, each worker runs 4 API calls using Haiku (our cheapest tier), totaling roughly $0.07 per run.

01:02 UTC — Worker 3 encounters a malformed response from an upstream data source. The retry handler, designed to re-attempt on transient failures, begins looping. A recent config change had bumped maxRetries from 3 to Infinity for a different failure mode — and the catch block was too broad.

01:02–07:00 UTC — Worker 3 retries continuously. Each retry fans out to the model router, which — because the task now carries an elevated priority flag from the retry escalation logic — routes to Opus instead of Haiku. At $15/million input tokens and $75/million output tokens, every call costs roughly 100x what it should.

07:00 UTC — Our daily cost tracker runs on schedule and emails the report:

ACTUAL COST: $912.47 (research pipeline only)

07:12 UTC — Alert acknowledged. We SSH into the VPS, identify the runaway process via journalctl, and kill it.

07:34 UTC — Root cause confirmed: the retry loop combined with priority escalation routing the calls to the wrong model tier.

08:46 UTC — Hotfix deployed: hard retry cap restored, catch block narrowed, and — critically — the first version of our spend circuit breaker goes live.

Root Cause

Two independently reasonable decisions collided.

Decision 1: Generous retries for transient failures. We had bumped maxRetries to a high value for network timeouts, because our VPS occasionally sees brief connectivity blips. The intent was sound — retry a dropped connection a few extra times before failing a task that took 30 minutes to queue.

Decision 2: Priority escalation on retry. Our model router uses a tiering system (we wrote about this previously). When a task retries more than twice, the router interprets this as "the cheaper model might lack the capability" and escalates to the next tier. Haiku → Sonnet → Opus. Three retries meant every subsequent call hit Opus pricing.

The combination: an infinite retry loop that escalated to the most expensive model on the third attempt, then kept calling it forever.

Our cost tracker at the time was a daily batch job. Here is what it looked like — a straightforward aggregator that tallied usage once per day and emailed a summary:

// Pricing per million tokens
const PRICING = {
  'claude-haiku-4-5':  { input: 0.80,  output: 4.00 },
  'claude-sonnet-4-6': { input: 3.00,  output: 15.00 },
  'claude-opus-4-6':   { input: 15.00, output: 75.00 },
  'gpt-4':             { input: 30.00, output: 60.00 },
  'ollama':            { input: 0,     output: 0 }
};

function generateDailyReport() {
  const pipeline = getPipelineCosts();
  const research = getResearchCosts();
  const chat = getChatCosts();
  const email = getEmailCosts();

  let totalCost = 0;
  const breakdown = [];

  if (pipeline) {
    totalCost += pipeline.cost;
    breakdown.push(pipeline);
  }

  if (research) {
    breakdown.push(research);
  }

  breakdown.push(chat);
  breakdown.push(email);

  // Log to JSONL
  const logEntry = {
    date: DATE,
    totalCost: totalCost.toFixed(4),
    breakdown
  };

  fs.appendFileSync(COST_LOG, JSON.stringify(logEntry) + '\n');
  return { totalCost, breakdown };
}

This code is correct. It does exactly what it says: generate a daily report. The problem is that "daily" means "after the money is already gone." A report that runs at 7 AM telling you about a spike at 1 AM is six hours of unmonitored burn.

The Fix

We rebuilt cost tracking in three layers: real-time tallying, rolling spend windows, and automatic circuit-breaking.

Layer 1: Per-Call Cost Attribution

Every API call now writes a cost entry immediately, attributed to the calling service:

function recordApiCost(serviceName, model, usage) {
  const pricing = PRICING[model];
  if (!pricing) return;

  const inputCost = (usage.input / 1_000_000) * pricing.input;
  const outputCost = (usage.output / 1_000_000) * pricing.output;
  const totalCost = inputCost + outputCost;

  const entry = {
    timestamp: Date.now(),
    service: serviceName,
    model,
    tokens: {
      input: usage.input,
      output: usage.output,
      cacheRead: usage.cacheRead || 0,
      cacheWrite: usage.cacheWrite || 0
    },
    cost: totalCost
  };

  fs.appendFileSync(COST_LOG, JSON.stringify(entry) + '\n');
  return totalCost;
}

This is the foundational change. Instead of estimating costs from report counts at end-of-day, we record the actual token counts from every API response the moment it arrives. The JSONL append is intentionally synchronous — we accept the ~1ms I/O penalty because cost data integrity matters more than shaving latency off a call that already took 2–8 seconds.

Layer 2: Rolling Spend Windows

We aggregate costs over sliding windows — 1 hour, 6 hours, and 24 hours — per service:

function getRollingSpend(serviceName, windowMs) {
  const cutoff = Date.now() - windowMs;
  const lines = fs.readFileSync(COST_LOG, 'utf8').split('\n');
  let total = 0;

  for (const line of lines) {
    if (!line.trim()) continue;
    try {
      const entry = JSON.parse(line);
      if (entry.timestamp < cutoff) continue;
      if (serviceName && entry.service !== serviceName) continue;
      total += entry.cost;
    } catch (e) {
      // skip malformed lines
    }
  }

  return total;
}

const WINDOWS = {
  hourly:  { ms: 60 * 60 * 1000,      limit: 5.00 },
  sixHour: { ms: 6 * 60 * 60 * 1000,  limit: 15.00 },
  daily:   { ms: 24 * 60 * 60 * 1000, limit: 50.00 }
};

Those limits — $5/hour, $15/six-hours, $50/day — reflect our actual usage patterns. On a normal day, our 25 services collectively spend under $2. A $5 hourly spend means something is already 60x above baseline. We derived these thresholds by analyzing 30 days of JSONL cost logs after backfilling them from our session transcripts.

Layer 3: Automatic Circuit Breaking

Every API call now passes through a spend gate before executing:

function checkSpendGate(serviceName) {
  for (const [window, config] of Object.entries(WINDOWS)) {
    const spent = getRollingSpend(serviceName, config.ms);
    if (spent >= config.limit) {
      // Fire alert
      sendAlert({
        level: 'critical',
        service: serviceName,
        window,
        spent: spent.toFixed(4),
        limit: config.limit,
        action: 'circuit_break'
      });

      return {
        allowed: false,
        reason: `${window} spend $${spent.toFixed(2)} exceeds $${config.limit} limit`
      };
    }
  }

  return { allowed: true };
}

// In the API call wrapper:
async function callModel(serviceName, model, messages) {
  const gate = checkSpendGate(serviceName);
  if (!gate.allowed) {
    throw new SpendLimitError(gate.reason);
  }

  const response = await anthropic.messages.create({ model, messages });

  recordApiCost(serviceName, model, {
    input: response.usage.input_tokens,
    output: response.usage.output_tokens,
    cacheRead: response.usage.cache_read_input_tokens || 0,
    cacheWrite: response.usage.cache_creation_input_tokens || 0
  });

  return response;
}

When the circuit breaks, the service gets a SpendLimitError. Each service's error handler decides what to do — most queue the task for retry after a cooldown, some fall back to the local Ollama model (zero cost), and a few simply skip the task and log it for manual review.

The Daily Report Still Runs

We kept the daily email. It serves a different purpose now — a human-readable summary rather than the primary detection mechanism:

function main() {
  log('Calculating daily usage...');

  const { today, stats } = getTodayUsage();

  const totalTokens = stats.input + stats.output
    + stats.cacheRead + stats.cacheWrite;

  const lines = [
    `Daily Usage Report`,
    `${dateStr}`,
    `${'='.repeat(40)}`,
    ``,
    `SUBSCRIPTION SERVICES`,
    `  Covered by flat monthly fee`,
    `  Messages:       ${stats.messages}`,
    `  Input tokens:   ${stats.input.toLocaleString()}`,
    `  Output tokens:  ${stats.output.toLocaleString()}`,
    `  Cache read:     ${stats.cacheRead.toLocaleString()}`,
    `  Cache write:    ${stats.cacheWrite.toLocaleString()}`,
    `  Total tokens:   ${totalTokens.toLocaleString()}`,
    `  Cost:           $0.00 (subscription)`,
  ];

  if (actualCost > 0) {
    lines.push(`ACTUAL COST:  $${actualCost.toFixed(4)} (API only)`);
  } else {
    lines.push(`ACTUAL COST:  $0.00`);
  }

  sendEmail(subject, lines.join('\n'));
}

The daily email now includes a "circuit breaker events" section showing every time the gate tripped in the past 24 hours. On quiet days, that section is empty. On the day we deployed this, it would have tripped at 01:08 UTC — six minutes into the incident instead of six hours.

Prevention

Beyond the circuit breaker, we made three structural changes:

1. Retry budgets replace retry counts. Each service gets a dollar-denominated retry budget per task. A Haiku retry costs fractions of a cent, so the budget allows dozens of attempts. An Opus retry costs 100x more, so the same budget allows only a few. This naturally prevents cost escalation through retries.

2. Model tier is locked at task creation. The router assigns a model tier when the task enters the queue. Retries stay on the same tier. Escalation requires explicit human approval or a separate escalation task with its own budget.

3. JSONL rotation. The cost log rotates daily. The getRollingSpend function reads at most two files (today and yesterday) to cover any 24-hour window. On our system with ~200 API calls per day, each file is roughly 40KB — trivial to scan synchronously.

Lessons for Your Team

Daily cost reports are post-mortems, and they always arrive too late. If your cost monitoring runs on a cron schedule measured in hours, it functions as an accounting tool — useful for budgeting, useless for incident prevention. Real-time per-call recording with rolling aggregation windows is the minimum viable approach.

Per-service attribution matters more than aggregate totals. A $50 daily spend across 25 services is normal. A $50 daily spend from one service is a five-alarm fire. The circuit breaker evaluates each service independently, so a healthy service remains unaffected when a sibling trips its limit.

Retry logic and cost logic must be aware of each other. The most dangerous cost spikes we have seen come from retry storms — a tight loop hitting an expensive endpoint repeatedly. Denominating retry budgets in dollars instead of attempt counts makes cost a first-class constraint in your retry policy.

File-based cost tracking works fine at our scale. We considered Redis, SQLite, and Supabase for cost aggregation. A single JSONL file with synchronous appends handles 200+ entries per day with zero operational complexity. The circuit breaker reads the full daily file (~40KB) on every API call and completes in under 2ms. Optimize when you need to — we have yet to need to.

The $900 was the cost of the lesson. The prevention layer cost two days to build. The ROI math on cost observability is straightforward: it pays for itself the first time it prevents a single runaway loop.

Conclusion

Our $912.47 overnight surprise was entirely preventable. The daily cost tracker we had built was solid engineering — accurate aggregation, clean per-service breakdown, reliable email delivery. It simply ran too late to matter during an incident. The three-layer system we replaced it with — real-time recording, rolling spend windows, and automatic circuit breaking — has been live for four months. It has tripped the gate seven times, each time killing a potential runaway within minutes instead of hours. Total cost of those seven incidents combined: $31.42.

The pattern generalizes to any team running AI services in production: if your cost monitoring is batch-oriented, you are always learning about problems after they have already become expensive. Real-time spend gates with per-service attribution turn cost management from a monthly spreadsheet exercise into an operational safety layer.

Need help building AI agent systems or designing multi-agent architectures? Ledd Consulting specializes in autonomous workflow design and agent orchestration for enterprise teams.