Keeping 60 Scheduled Timers Reliable on a Single VPS With Mission Control

At Ledd Consulting, we run 25 microservices, 7 autonomous agents, and over 60 systemd timers on a single VPS. Every one of those timers matters — they scrape job boards, post to social channels, run nightly research pipelines, generate client reports, and reconcile billing data. When a timer silently dies at 2 AM, the downstream consequences ripple for hours before anyone notices.

This post walks through the supervision layer we built to keep that entire fleet healthy: staleness detection, automatic restart with rate limiting, state-change alerting, and a centralized mission control dashboard that gives us a single pane of glass over everything.

The Pain Point

If you run scheduled automation in production, you already know the failure mode. A cron job stops firing. Maybe the process OOM-killed. Maybe a dependency timed out and the exit code swallowed the error. Maybe systemd marked the timer as inactive (dead) after a single failure and moved on.

The insidious part: everything looks fine. The timer unit exists. systemctl list-timers shows it. The service file is intact. But the last trigger was 47 hours ago, and the downstream system is serving stale data to your clients.

We hit this exact scenario when our social posting timers went silent for three days. The posts were still queued — the timer just stopped triggering the service. Downstream analytics showed zero engagement and we assumed it was a content problem. It was an infrastructure problem, invisible because systemd timers fail quietly.

Why Common Solutions Fall Short

Cron + log tailing. The classic approach: set up cron jobs and tail the logs. This works fine for five jobs. At 60, you're grep-ing through a dozen log files, hoping the timestamp format is consistent, and manually computing "has this run recently enough?" The cognitive load scales linearly with the timer count.

External monitoring services. Healthchecks.io, Cronitor, and similar tools are excellent — for jobs that actively phone home. They require each job to curl a check-in URL on completion. That means modifying every single script, handling the case where the curl itself fails, and paying per-seat for 60+ monitors. For a single-VPS setup, the overhead-to-value ratio is unfavorable.

Systemd's built-in OnFailure=. Systemd lets you trigger a failure handler unit. We use this selectively, but it only fires when the service fails — it tells you nothing about a timer that simply stopped scheduling. A timer that quietly enters inactive state triggers zero OnFailure handlers.

What we needed was a layer that sits above all the timers, continuously asks "when did each one last fire?", and takes corrective action when the answer is "too long ago."

Our Approach

We built three interlocking components:

  1. System Health Monitor — a Node.js script that runs every 15 minutes (itself a systemd timer, which is wonderfully recursive), queries every timer's last trigger time, compares it against a per-timer staleness threshold, and auto-restarts anything that's gone stale.
  2. Uptime Monitor — a complementary service that checks HTTP endpoints, systemd unit states, and raw port connectivity every 10 minutes, alerting only on state transitions (up→down or down→up) to avoid alert fatigue.
  3. Mission Control — a centralized API that aggregates health data from both monitors, the observer dashboard, and the agent fleet into a single queryable endpoint with stale-while-revalidate caching.

The key design decision: each timer gets its own staleness budget. A job that runs every 6 hours tolerates 400 minutes of silence. A nightly pipeline tolerates 1,500 minutes. A bid-drafting job that runs every 4 hours gets a tighter 300-minute leash. This eliminates false positives from timers with legitimately long intervals.

Implementation

The Timer Registry

Every monitored resource lives in a single registry object with its type, staleness threshold, and criticality flag:

const MONITORED_SERVICES = {
  // Always-running daemons (health-checked via HTTP or systemctl)
  'gateway.service':        { type: 'daemon', healthUrl: null, critical: true },
  'nginx.service':          { type: 'daemon', critical: true },
  'notification-service.service': { type: 'daemon', healthUrl: 'http://127.0.0.1:8080/health', critical: true },
  'event-trigger.service':  { type: 'daemon', healthUrl: 'http://127.0.0.1:4000/health', critical: true },

  // Timer-based services (staleness-checked)
  'morning-briefing.timer':     { type: 'timer', maxStaleMinutes: 1500, critical: false },
  'job-scraper.timer':          { type: 'timer', maxStaleMinutes: 400,  critical: false },
  'social-post.timer':          { type: 'timer', maxStaleMinutes: 400,  critical: false },
  'nightly-pipeline.timer':     { type: 'timer', maxStaleMinutes: 1500, critical: false },
  'bid-drafter.timer':          { type: 'timer', maxStaleMinutes: 300,  critical: false },
  'evening-reflection.timer':   { type: 'timer', maxStaleMinutes: 1500, critical: false },
  'client-nurture.timer':       { type: 'timer', maxStaleMinutes: 1500, critical: false },
};

This registry is the single source of truth. Adding a new timer to supervision means adding one line. The critical flag controls whether a failure triggers an email escalation or just gets logged and auto-fixed.

Staleness Detection

The core insight: systemd already tracks when each timer last fired. We just need to read it:

function getTimerLastTrigger(timer) {
  try {
    const output = execSync(
      `systemctl show ${timer} --property=LastTriggerUSec 2>/dev/null`,
      { encoding: 'utf8' }
    ).trim();
    const match = output.match(/LastTriggerUSec=(.+)/);
    if (match && match[1] !== 'n/a' && match[1] !== '') {
      const date = new Date(match[1]);
      if (!isNaN(date.getTime())) return date;
    }
  } catch {}
  return null;
}

function minutesSince(date) {
  if (!date) return Infinity;
  return Math.floor((Date.now() - date.getTime()) / 60000);
}

Returning Infinity when the date is null is deliberate — a timer that has never triggered is the stalest possible timer and should always exceed its threshold. This catches the subtle case where a timer unit was installed but the initial trigger was missed.

Auto-Restart With Rate Limiting

When a timer exceeds its staleness budget, the monitor restarts it. But blind restarts are dangerous — a service that's crash-looping will thrash the system if you restart it every 15 minutes forever. We cap restarts at three per hour per service:

// Attempt restart (max 3 times per hour)
const attempts = state.restartAttempts[service] || { count: 0, lastAttempt: 0 };
const hourAgo = Date.now() - 3600000;

if (attempts.lastAttempt < hourAgo) {
  attempts.count = 0; // Reset counter after an hour
}

if (attempts.count < 3) {
  log(`Attempting restart of ${service} (attempt ${attempts.count + 1}/3)`);
  const ok = restartService(service);
  attempts.count++;
  attempts.lastAttempt = Date.now();
  state.restartAttempts[service] = attempts;

  if (ok) {
    fixed.push(service);
    healthy++;
  }
} else {
  log(`${service} has failed 3 restart attempts this hour — escalating`);
}

The state is persisted to disk between runs, so the rate limit survives process restarts:

{
  "restartAttempts": {
    "gateway.service": { "count": 1, "lastAttempt": 1773561602551 },
    "notification-service.service": { "count": 3, "lastAttempt": 1773103501814 }
  },
  "lastFullCheck": "2026-03-16T10:00:01.727Z",
  "issues": [
    { "service": "job-scraper.timer", "issue": "never triggered", "critical": false },
    { "service": "nightly-pipeline.timer", "issue": "stale (3120 min since last run)", "critical": false }
  ],
  "lastSummaryAt": "2026-03-16T09:30:00.769Z"
}

That's real production state. The notification service hit its 3-restart cap and escalated. The job scraper timer shows "never triggered" — it was recently reinstalled and awaiting its first cycle.

State-Change Alerting (Eliminating Alert Fatigue)

The uptime monitor complements the health monitor by tracking transitions rather than absolute states. It alerts when something goes down and again when it recovers — never on steady-state:

for (const result of results) {
  const prev = state.services[result.id];
  const wasDown = prev && prev.status === 'down';
  const isDown = result.status === 'down';

  if (!wasDown && isDown) {
    // Transition: UP → DOWN
    log(`STATE CHANGE: ${result.name} went DOWN (${result.error})`);
    alertDown(result);
    stateChanged = true;
  } else if (wasDown && !isDown) {
    // Transition: DOWN → UP
    log(`STATE CHANGE: ${result.name} RECOVERED`);
    alertRecovered(result, prev.last_down);
    stateChanged = true;
  } else {
    // Steady state — update timestamp only
    state.services[result.id].last_check = now;
  }
}

This pattern means a service that's been down for six hours generates exactly two emails: one when it went down, one when it recovered. Compare that with naive polling that would send 36 alerts for the same incident.

The Heartbeat Fallback

Here's the meta-problem: what watches the watcher? Our primary orchestrator runs on a 10-minute heartbeat cycle, writing a rotation file on each pass. A separate bash script — itself a systemd timer — checks whether that rotation file is fresh:

# Check if heartbeat file was updated in the last 15 minutes
if [ -f "$HEARTBEAT_FILE" ]; then
  FILE_AGE=$(( $(date +%s) - $(stat -c %Y "$HEARTBEAT_FILE") ))
  if [ "$FILE_AGE" -lt 900 ]; then
    # Heartbeat is fresh — built-in loop is working, skip
    exit 0
  fi
fi

# Stale heartbeat — wake the orchestrator
curl -s -X POST http://127.0.0.1:5000/hooks/wake \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $HOOK_TOKEN" \
  -d "{\"text\":\"HEARTBEAT FALLBACK: rotation file is ${FILE_AGE}s stale. Run your full cycle NOW.\"}" \
  > /dev/null 2>&1

Simple, brutal, effective. File age check via stat, 900-second threshold, direct wake call if stale. The entire fallback is 15 lines of bash.

Mission Control Aggregation

The mission control service pulls everything together into a single API, using stale-while-revalidate caching to stay fast even when downstream sources are slow:

const cache = {};
function cached(key, ttlMs, fetchFn) {
  const entry = cache[key];
  const now = Date.now();
  if (entry && now - entry.ts < ttlMs) return entry.data;
  // Stale-while-revalidate: return stale data, refresh in background
  if (entry) {
    fetchFn().then(d => { cache[key] = { data: d, ts: Date.now() }; }).catch(() => {});
    return entry.data;
  }
  // First call — block until data arrives
  return fetchFn().then(d => { cache[key] = { data: d, ts: Date.now() }; return d; });
}

A single /health call to mission control returns the aggregated state of every timer, every daemon, every agent, and the overall operational effectiveness score — all within 30ms because the cache serves stale data while background-refreshing from the observer.

Event-Triggered Recovery

The final piece: when the health monitor detects a critical failure that survives three restart attempts, it fires an event to the event trigger service, which wakes the orchestrator with a specific remediation instruction:

// Fire event for orchestrator-level remediation
await postSignedJson('127.0.0.1', 4000, '/trigger', {
  type: 'system.service_down',
  data: {
    serviceName: criticalIssues.map(i => i.service).join(', '),
    services: criticalIssues.map(i => i.service),
    issues: criticalIssues
  }
}, { source: 'system-health-monitor' });

The event trigger maps this to a specific instruction: attempt a restart, run diagnostics, and email the team if the restart fails. This closes the loop — the system escalates from automatic restart → event-triggered recovery → human notification, each tier activating only when the previous one is exhausted.

Results

Since deploying this supervision layer:

  • 18 of 18 monitored services checked every 15 minutes, 96 times per day
  • State-change alerting reduced alert volume by ~95% compared to per-check notifications
  • Auto-restart resolves the majority of transient failures (process OOM, dependency timeout) before any human sees an alert
  • 3-per-hour rate limiting prevents restart thrashing — we've seen exactly one crash-loop escalation in production, which correctly triggered the email path
  • 6-hour periodic summaries via the notification router give a pulse check even when everything is healthy ("18/18 OK, all clear")
  • Heartbeat fallback has triggered 4 times in production, each time successfully recovering the orchestrator within seconds

The entire supervision layer is pure Node.js and bash — stdlib http, fs, child_process, and execSync. Total code across the three monitors: ~500 lines.

Adapting This for Your System

The pattern generalizes to any timer-based automation fleet:

  1. Build a registry. Every timer gets an entry with its expected cadence and a staleness threshold set to 2–3x the normal interval. This eliminates false positives while still catching genuine failures.
  2. Query the scheduler, avoid instrumenting each job. Systemd's LastTriggerUSec property gives you staleness data for free. If you're on cron, check syslog timestamps or write a one-line completion marker file.
  3. Rate-limit your restarts. Three attempts per hour with a persisted counter is a solid default. After exhaustion, escalate — automated restarts of a fundamentally broken service just mask the problem.
  4. Alert on transitions, log on steady state. Your on-call engineer needs to know when something changes. They already know it's still broken.
  5. Layer your fallbacks. A timer that watches other timers can itself fail. Add a trivial heartbeat file check (file age via stat) as the outermost safety net. Keep it dead simple — the fallback for your fallback should be 15 lines of bash, maximum.
  6. Cache aggressively for dashboards. Mission control serves stale data while revalidating in the background. Your dashboard should always load instantly, even if the underlying data is 30 seconds old.

Conclusion

Scheduled automation at scale demands active supervision. Timers fail silently, and the gap between "the timer exists" and "the timer is actually firing on schedule" is where production incidents hide. A lightweight supervision layer — staleness budgets, rate-limited auto-restart, state-change alerting, and a heartbeat fallback — keeps a fleet of 60+ timers healthy on a single VPS with minimal operational overhead.

The entire system is ~500 lines of Node.js, runs on the same box it monitors, and has caught every silent timer failure we've had in production. The meta-lesson: infrastructure that watches itself is worth more than infrastructure that merely runs.

Need help building AI agent systems or designing multi-agent architectures? Ledd Consulting specializes in autonomous workflow design and agent orchestration for enterprise teams.

Read more

Intelligence Brief — Saturday, April 11, 2026

MetalTorque Daily Brief — 2026-04-11 Cross-Swarm Connections The Audit Trail Is the Attack Surface — Everywhere. Three swarms converged on the same structural conclusion from radically different entry points. Agentic Design found that peer-preservation corrupts agent-generated logs, confidence inflation poisons self-reported metrics, and context contamination makes audit-time behavior diverge from production behavior.

By Ledd Consulting