An Uptime Monitor That Costs Nothing and Covers 26 Services

At Ledd Consulting, we run 25+ services on a single VPS. Seven AI agents, a marketplace API, an analytics pipeline, a newsletter platform, an MCP server, a notification router, and a fleet of scheduled timers that fire throughout the day. When we started looking at uptime monitoring, every SaaS option we evaluated charged per-endpoint — and the math got ugly fast.

The Problem

We needed monitoring across three distinct layers: public HTTP endpoints, internal systemd daemons, and raw TCP port checks. Most uptime services handle the first category fine. Few handle the second. None handle the third without bolting on a separate agent binary that phones home to their cloud.

Here's what the pricing looked like for our stack:

  • Better Uptime: $29/month for 50 monitors. Covers HTTP only.
  • Datadog Synthetics: $7.20 per 10,000 test runs. At 26 services checked every 10 minutes, that's ~112,000 runs/month — roughly $80.
  • UptimeRobot: Free tier caps at 50 monitors with 5-minute intervals, but no systemd or port awareness.

None of these could tell us "your IPC bus process is active on systemd but its health endpoint returns 503." That's the failure mode that actually bites us — a zombie service that's technically running but functionally dead. We needed three-layer verification: is the process alive, is the port open, and does the endpoint respond correctly?

So we built our own. Total cost: zero. Total lines of code: about 300 across two scripts.

Architecture Overview

The monitoring stack has two components that run on offset schedules:

┌─────────────────────────────────────────────────┐
│              Uptime Monitor (every 10 min)       │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐      │
│  │  HTTP     │  │ systemd  │  │   TCP    │      │
│  │  Checks   │  │ Checks   │  │  Ports   │      │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘      │
│       └──────────┬───┘─────────────┘             │
│            State Comparison                       │
│       (previous run vs. current)                  │
│            ┌─────┴──────┐                        │
│            │ Changed?   │──No──▶ silent           │
│            └─────┬──────┘                        │
│              Yes │                                │
│         ┌────────┴────────┐                      │
│         │  Alert + Write  │                      │
│         │  status.json    │                      │
│         └─────────────────┘                      │
└─────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────┐
│         Health Monitor (every 15 min)            │
│  ┌──────────────┐  ┌────────────────────┐       │
│  │ Daemon Health │  │ Timer Staleness    │       │
│  │ + HTTP /health│  │ (maxStaleMinutes)  │       │
│  └──────┬───────┘  └────────┬───────────┘       │
│         └────────┬───────────┘                   │
│           Auto-Restart (max 3/hr)                │
│           Escalation if restart fails            │
└─────────────────────────────────────────────────┘

The uptime monitor checks whether things are reachable. The health monitor checks whether things are functioning. Together, they cover 26 services across three check types, plus staleness detection on 11 scheduled timers.

Implementation Walkthrough

Three-Layer Health Checks

The core insight is that a single check type isn't enough. We define services across three vectors and run all checks concurrently:

const WEB_ENDPOINTS = [
  { id: 'consulting-site', name: 'Consulting Site', url: 'https://app.example.com/' },
  { id: 'newsletter', name: 'Newsletter Platform', url: 'https://intel.example.com/' },
  { id: 'marketplace-api', name: 'Marketplace API', url: 'https://marketplace.example.com/health' },
];

const SYSTEMD_SERVICES = [
  { id: 'svc-newsletter', name: 'Newsletter (systemd)', unit: 'ghost-newsletter' },
  { id: 'svc-marketplace', name: 'Marketplace (systemd)', unit: 'marketplace-api' },
  { id: 'svc-ipc', name: 'IPC Bus (systemd)', unit: 'notification-service' },
  { id: 'svc-watcher', name: 'Watcher (systemd)', unit: 'task-processor' },
  { id: 'svc-tunnel', name: 'Tunnel (systemd)', unit: 'cloudflared-tunnel' },
  { id: 'svc-analytics', name: 'Analytics API (systemd)', unit: 'api-analytics' },
];

const PORT_CHECKS = [
  { id: 'port-newsletter', name: 'Newsletter (port 8080)', port: 8080 },
  { id: 'port-marketplace', name: 'Marketplace (port 3000)', port: 3000 },
  { id: 'port-mcp', name: 'MCP Server (port 4000)', port: 4000 },
  { id: 'port-ipc', name: 'IPC Bus (port 5000)', port: 5000 },
  { id: 'port-watcher', name: 'Watcher (port 5001)', port: 5001 },
  { id: 'port-analytics', name: 'Analytics (port 5002)', port: 5002 },
];

Each check type uses a different mechanism. HTTP checks use Node's built-in http/https modules with a 15-second timeout — we bumped this from 5 seconds after our newsletter platform (Ghost) occasionally took 8-12 seconds to respond under memory pressure:

function checkHttpEndpoint(endpoint) {
  return new Promise((resolve) => {
    const startTime = Date.now();
    const urlObj = new URL(endpoint.url);
    const client = urlObj.protocol === 'https:' ? https : http;

    const req = client.get(endpoint.url, { timeout: HTTP_TIMEOUT }, (res) => {
      res.resume();
      const elapsed = Date.now() - startTime;
      if (res.statusCode >= 200 && res.statusCode < 400) {
        resolve({ id: endpoint.id, name: endpoint.name, type: 'http',
                  status: 'up', responseTime: elapsed, statusCode: res.statusCode });
      } else {
        resolve({ id: endpoint.id, name: endpoint.name, type: 'http',
                  status: 'down', responseTime: elapsed,
                  error: `HTTP ${res.statusCode}` });
      }
    });

    req.on('timeout', () => { req.destroy();
      resolve({ id: endpoint.id, name: endpoint.name, type: 'http',
                status: 'down', responseTime: HTTP_TIMEOUT, error: 'Timeout' });
    });

    req.on('error', (err) => {
      resolve({ id: endpoint.id, name: endpoint.name, type: 'http',
                status: 'down', responseTime: Date.now() - startTime, error: err.message });
    });
  });
}

Note the pattern: every check resolves, never rejects. This is deliberate. We use Promise.all to run all 26 checks concurrently, and a single rejection would kill the entire batch. A down service is a data point, not an exception.

Systemd checks shell out to systemctl is-active:

function checkSystemdService(service) {
  return new Promise((resolve) => {
    try {
      const result = execSync(`systemctl is-active ${service.unit}`,
        { timeout: 5000, encoding: 'utf8' }).trim();
      resolve({ id: service.id, name: service.name, type: 'systemd',
                status: result === 'active' ? 'up' : 'down',
                error: result !== 'active' ? `State: ${result}` : null });
    } catch (err) {
      const output = (err.stdout || '').trim();
      resolve({ id: service.id, name: service.name, type: 'systemd',
                status: 'down', error: `State: ${output || 'unknown'}` });
    }
  });
}

Port checks use raw TCP sockets with a 3-second timeout — intentionally shorter than the HTTP timeout because if a port isn't accepting connections within 3 seconds on localhost, something is genuinely wrong:

function checkPort(portDef) {
  return new Promise((resolve) => {
    const socket = new net.Socket();
    socket.setTimeout(3000);
    socket.on('connect', () => { socket.destroy();
      resolve({ id: portDef.id, name: portDef.name, type: 'port', status: 'up' });
    });
    socket.on('timeout', () => { socket.destroy();
      resolve({ id: portDef.id, name: portDef.name, type: 'port',
                status: 'down', error: 'Connection timeout' });
    });
    socket.on('error', (err) => {
      resolve({ id: portDef.id, name: portDef.name, type: 'port',
                status: 'down', error: err.message });
    });
    socket.connect(portDef.port, '127.0.0.1');
  });
}

State-Change Alerting (The Anti-Spam Pattern)

This is the feature that makes the monitor livable. Nobody wants 144 "all clear" emails per day. We persist state to disk between runs and only alert on transitions:

function loadState() {
  try {
    return JSON.parse(fs.readFileSync(STATE_FILE, 'utf8'));
  } catch {
    return { services: {} };
  }
}

function saveState(state) {
  fs.writeFileSync(STATE_FILE, JSON.stringify(state, null, 2));
}

After each check cycle, we compare the current status of every service against the previous run's persisted state. An alert fires only when a service transitions from up to down or from down to up. In practice, this means we receive 2-5 emails per week instead of the hundreds that naive interval-based alerting produces.

The status file also gets written to a web-accessible path as status.json, giving us a public status page with zero additional infrastructure — just a static HTML page that fetches and renders the JSON.

Self-Healing With Escalation Caps

The second script — the system health monitor — doesn't just detect problems. It fixes them. But with guardrails:

if (attempts.count < 3) {
  log(`Attempting restart of ${service} (attempt ${attempts.count + 1}/3)`);
  const ok = restartService(service);
  attempts.count++;
  attempts.lastAttempt = Date.now();
  state.restartAttempts[key] = attempts;

  if (ok) {
    fixed.push(service);
    healthy++;
  }
} else {
  log(`${service} has failed 3 restart attempts this hour — escalating`);
}

The cap is 3 restart attempts per service per hour. After that, the monitor stops trying and escalates to email. This prevents a crash-loop scenario where the monitor restarts a service that immediately dies, over and over, potentially masking a deeper issue or burning through system resources.

The restart counter resets after one hour, so transient failures that resolve themselves don't permanently block auto-recovery:

const hourAgo = Date.now() - 3600000;
if (attempts.lastAttempt < hourAgo) {
  attempts.count = 0;
}

Timer Staleness Detection

Beyond daemons and endpoints, we monitor 11 scheduled timers — jobs that should run on predictable intervals. The health monitor tracks when each timer last fired and compares against a configured maximum staleness:

const MONITORED_SERVICES = {
  'morning-briefing.timer': { type: 'timer', maxStaleMinutes: 1500, critical: false },
  'job-scraper.timer': { type: 'timer', maxStaleMinutes: 400, critical: false },
  'bid-drafter.timer': { type: 'timer', maxStaleMinutes: 300, critical: false },
  'client-nurture.timer': { type: 'timer', maxStaleMinutes: 1500, critical: false },
  // ... 11 timers total
};

The check uses systemd's LastTriggerUSec property:

function getTimerLastTrigger(timer) {
  try {
    const output = execSync(
      `systemctl show ${timer} --property=LastTriggerUSec 2>/dev/null`,
      { encoding: 'utf8' }
    ).trim();
    const match = output.match(/LastTriggerUSec=(.+)/);
    if (match && match[1] !== 'n/a' && match[1] !== '') {
      const date = new Date(match[1]);
      if (!isNaN(date.getTime())) return date;
    }
  } catch {}
  return null;
}

A maxStaleMinutes of 1500 (25 hours) for daily timers gives us a one-hour grace window. If the morning briefing hasn't fired in 25 hours, something is wrong. The bid drafter at 300 minutes (5 hours) gets flagged faster because it runs more frequently and we need those bids submitted on time.

What Surprised Us

The 5-second timeout was a lie. Our initial HTTP timeout was 5 seconds, which seemed generous for checking our own services. Then Ghost (our newsletter platform) started throwing false positives during its periodic content indexing cycle, where response times would spike to 8-12 seconds. We bumped the timeout to 15 seconds and the false positives vanished. The lesson: even on localhost-adjacent infrastructure, your services have background workloads that affect response times in ways you don't expect.

State file corruption was a real failure mode. During one incident, the monitor process was killed mid-write, leaving a truncated JSON file. The next run couldn't parse state, treated everything as "new," and fired alerts for all 26 services simultaneously at 3 AM. The try/catch that falls back to { services: {} } was added after that incident. We considered using write-then-rename for atomic writes but decided the simpler fallback was sufficient — a single burst of false alerts is annoying but not dangerous.

Three restart attempts per hour is the right number. We started with 5. The problem: a service with a configuration error would get restarted 5 times in rapid succession, each time consuming resources during startup before crashing again. At 3 attempts, we catch transient failures (which almost always recover on the first try) without hammering the system on persistent failures.

Lessons Learned

State-change alerting is non-negotiable. Every monitoring system we've ever abandoned, we abandoned because of alert fatigue. The state-file pattern costs 10 lines of code and reduces alert volume by 95%+.

Always resolve, never reject. When you're running health checks concurrently with Promise.all, a single rejection kills the batch. Model check results as data (status: 'up' or 'down'), not as success/failure of the check itself. This is a small API design choice that prevents an entire class of bugs.

Separate detection from healing. Our two-script architecture — one for detection and alerting, another for health checks and auto-restart — means each can run on its own schedule and fail independently. The uptime monitor runs every 10 minutes; the health monitor every 15. If the health monitor crashes, detection still works. If detection misses a cycle, auto-healing still runs.

Staleness is a better signal than status for scheduled work. A timer can show active in systemd while its actual job hasn't fired in 36 hours due to a dependency failure. Tracking LastTriggerUSec against expected intervals catches failures that process-level monitoring misses entirely.

Zero dependencies means zero dependency failures. The entire monitor uses Node.js built-ins: http, https, net, fs, child_process. No axios, no node-fetch, no monitoring libraries. When your monitoring tool has a dependency chain, you've created a new failure mode for the system that's supposed to catch failure modes.

Conclusion

We've been running this setup for months. It monitors 26 services across three check types, auto-heals transient failures, detects stale timers, and alerts only on state changes. The total resource cost is negligible — two Node.js scripts that each run for under 2 seconds, every 10 and 15 minutes respectively. The total dollar cost is zero.

The SaaS monitoring industry has convinced teams that uptime monitoring requires a dedicated platform. For most single-host or small-cluster deployments, 300 lines of Node.js and a systemd timer will outperform any external service — because it can check things external services simply cannot reach.

Need help building AI agent systems or designing multi-agent architectures? Ledd Consulting specializes in autonomous workflow design and agent orchestration for enterprise teams.

Read more

Intelligence Brief — Saturday, April 11, 2026

MetalTorque Daily Brief — 2026-04-11 Cross-Swarm Connections The Audit Trail Is the Attack Surface — Everywhere. Three swarms converged on the same structural conclusion from radically different entry points. Agentic Design found that peer-preservation corrupts agent-generated logs, confidence inflation poisons self-reported metrics, and context contamination makes audit-time behavior diverge from production behavior.

By Ledd Consulting