The Single-Manifest Control Plane: How One JSON File Governs 26 Services, 19 Automations, and 30+ Processes

The Pattern — Catalog-Driven Infrastructure

One JSON file declares every service, health check, automation, and process in your infrastructure. Every other artifact — documentation, health dashboards, drift reports, reconciliation scripts — is derived from that single manifest. Nothing is hand-maintained downstream.

At Ledd Consulting, we run 26 health-checked services, 19 recurring automations, and 30+ managed processes on a single VPS. We govern all of it from one 900-line JSON file and five scripts that read it. When we add a service, we edit one file. When documentation drifts from reality, we regenerate it. When a process disappears, the control plane notices within 30 minutes.

This isn't Kubernetes. It's not Terraform. It's 600 lines of Node.js that solved the same problems those tools solve — for a fleet that doesn't need their complexity.

The Naive Approach (and Why It Fails)

Most teams managing a modest fleet of services on a VPS end up with the same entropy:

  • A README listing services that was accurate three months ago
  • A monitoring script that checks 18 of your 26 endpoints because someone forgot to add the last 8
  • Systemd units scattered across /etc/systemd/system/ with no inventory
  • PM2 processes that nobody remembers starting, running under the wrong user
  • Health check URLs hardcoded in three different scripts

The failure mode isn't dramatic. It's slow rot. You deploy a new service, forget to add it to the health checker, and three weeks later it's been down for five days with nobody noticing. You retire a process but it keeps running because nobody cleaned up PM2. Your documentation says you have 20 services; you actually have 26.

We lived this. At service number 15, we could still keep it in our heads. By service 22, we couldn't. The moment we caught a retired process still consuming memory after two weeks, we decided every piece of infrastructure metadata needed to live in one place.

Pattern Implementation

Layer 1: The Manifest

The entire system starts with a single JSON file. Here's the structure (trimmed for clarity):

{
  "references": {
    "runtimeTruth": "SYSTEM-STATE-LIVE.md",
    "systemState": "SYSTEM-STATE.md",
    "automationStateDb": "state/automation-state.sqlite"
  },
  "policy": {
    "browser": "sandboxed Chromium on loopback",
    "services": "least-privilege systemd by default",
    "auth": "loopback plus signed envelopes plus token auth",
    "exceptions": ["product-builder root service", "PM2 legacy fleet"]
  },
  "services": [
    {
      "service": "notification-router",
      "unit": "notification-router.service",
      "owner": "appuser",
      "port": 4000,
      "processModel": "systemd",
      "bindPolicy": "loopback",
      "authPolicy": "token auth",
      "ingress": "localhost only",
      "writablePaths": [
        "notifications/",
        "state/automation-state.sqlite"
      ]
    }
  ],
  "healthChecks": [
    {
      "port": 4000,
      "service": "notification-router",
      "purpose": "Smart notification batching",
      "healthPath": "/health"
    }
  ],
  "automations": [
    {
      "name": "system-state-live",
      "label": "System State Live Snapshot",
      "service": "system-state-live.service",
      "timer": "system-state-live.timer",
      "schedule": "Every 30 minutes",
      "owner": "appuser",
      "expectedIntervalMinutes": 30,
      "expectedMaxRuntimeMinutes": 5,
      "state": "SYSTEM-STATE-LIVE.{md,json}"
    }
  ]
}

Every service declares its port, owner, auth policy, bind policy, and writable paths. Every health check declares its port and endpoint. Every automation declares its schedule and expected runtime. This is the contract. Everything downstream reads it.

Layer 2: The Catalog Loader

A thin accessor layer normalizes and sorts catalog data. This is the interface every other script imports:

const CATALOG_FILE = path.join(__dirname, 'vps-catalog.json');

function loadCatalog() {
  return JSON.parse(fs.readFileSync(CATALOG_FILE, 'utf8'));
}

function getServices(catalog = loadCatalog()) {
  return [...(catalog.services || [])].sort((a, b) => (a.port || 0) - (b.port || 0));
}

function getHealthChecks(catalog = loadCatalog()) {
  return [...(catalog.healthChecks || [])].sort((a, b) => a.port - b.port);
}

function getAutomations(catalog = loadCatalog()) {
  return [...(catalog.automations || [])].sort((a, b) => a.name.localeCompare(b.name));
}

Sorting by port or name seems trivial. It matters more than you'd expect — when you're scanning a health report at 2 AM, predictable ordering means you find the failing service in two seconds instead of twenty.

The loader also includes a generic markdown table renderer that every downstream generator uses:

function renderMarkdownTable(headers, rows) {
  const headerRow = `| ${headers.join(' | ')} |`;
  const separatorRow = `| ${headers.map(() => '---').join(' | ')} |`;
  const bodyRows = rows.map(row => `| ${row.join(' | ')} |`);
  return [headerRow, separatorRow, ...bodyRows].join('\n');
}

Layer 3: Document Generation

One script reads the catalog and produces five markdown files:

const catalog = loadCatalog();

function buildVpsCatalog() {
  const healthChecks = getHealthChecks(catalog);
  const services = getServices(catalog);
  const pm2Drift = getPm2Drift();

  return `
# VPS Catalog

Generated from \`control-plane/vps-catalog.json\`.

## Health-Checked Services

${renderHealthCatalogTable(catalog)}

## Process Model Summary

- Service inventory entries: ${services.length}
- PM2 expected drift: missing ${pm2Drift.missingExpected.length}, stopped ${pm2Drift.stoppedExpected.length}

## Summary

- Health targets: ${healthChecks.length}
- Managed services: ${services.length}
- Automations: ${getAutomations(catalog).length}
`;
}

writeFile(vpsCatalogFile, buildVpsCatalog());
writeFile(configReposFile, buildConfigRepos());
writeFile(automationsFile, buildAutomations());
writeFile(processModelFile, buildProcessModel());
writeFile(pm2FleetFile, buildPm2Fleet());

Five files, all generated, all consistent, all traceable to a single source. We never edit these markdown files by hand. If they're wrong, the catalog is wrong — and we fix the catalog.

Layer 4: Health Aggregation

The live health checker imports the catalog and iterates over every declared target:

const { getHealthChecks, getServices, loadCatalog } = require('./control-plane/catalog');

const catalog = loadCatalog();
const HEALTH_TARGETS = getHealthChecks(catalog);

async function checkHealth(target) {
  try {
    const healthPath = target.healthPath || '/health';
    const res = await fetch(`http://127.0.0.1:${target.port}${healthPath}`, {
      signal: AbortSignal.timeout(4000)
    });
    const body = (await res.text()).slice(0, 200);
    return { ...target, ok: res.ok, status: res.status, body };
  } catch (err) {
    return { ...target, ok: false, error: err.message };
  }
}

This is the key payoff: when you add a health check entry to the catalog, it automatically gets checked every 30 minutes, appears in the generated docs, shows up in the live state dashboard, and triggers alerts on failure. No wiring. No second file to update.

The health checker also validates policy compliance — verifying that services run under the declared user, bind to the correct interface, and maintain expected auth configuration:

function inspectService(unit, meta = {}) {
  const raw = run(`systemctl show ${unit} --no-pager -p User -p Group -p ActiveState`);
  const parsed = parseKeyValueLines(raw);
  const expectedUser = meta.owner || '';
  const actualUser = parsed.User || null;
  const drift = [];
  if (expectedUser && actualUser && expectedUser !== actualUser)
    drift.push(`user ${actualUser}!=${expectedUser}`);
  return {
    service: meta.service || unit,
    expectedUser,
    actualUser,
    healthy: parsed.ActiveState === 'active',
    drift,
  };
}

Layer 5: Reconciliation

The reconciler is the enforcement arm. It reads the catalog, generates docs, syncs files to the live workspace, and restarts services that have changed:

function copyIfChanged(src, dst) {
  const srcBuf = fs.readFileSync(src);
  const dstExists = fs.existsSync(dst);
  if (dstExists) {
    const dstBuf = fs.readFileSync(dst);
    if (Buffer.compare(srcBuf, dstBuf) === 0) return false;
  }
  fs.copyFileSync(src, dst);
  return true;
}

function main() {
  // Step 1: Generate fresh docs from catalog
  execFileSync('node', [generatorScript], { stdio: 'inherit' });

  // Step 2: Sync workspace files (only if changed)
  const changes = [...syncWorkspace(), ...syncRepoAssets()];

  // Step 3: Reload systemd if unit files changed
  execFileSync('systemctl', ['daemon-reload'], { stdio: 'inherit' });

  // Step 4: Reconcile PM2 fleet if relevant files changed
  if (shouldReconcilePm2(changes)) {
    execFileSync('node', [reconcilePm2Script], { stdio: 'inherit' });
  }

  // Step 5: Refresh automation summary and restart affected services
  execFileSync('systemctl', ['restart',
    'notification-router.service', 'mission-control.service'],
    { stdio: 'inherit' });
}

The copyIfChanged function is critical. Binary comparison means we only restart services when files actually changed — not on every reconciliation run. This keeps the system stable during routine syncs.

Layer 6: Drift Detection

For the PM2 fleet, drift detection compares the catalog's expected state against live reality:

function main() {
  const before = getLivePm2Inventory(catalog);
  const plan = buildManagedPlan(catalog, before);
  const artifacts = writePm2RuntimeArtifacts(catalog, plan);

  // Apply expected state
  tryPm2(['startOrReload', artifacts.internalFile, '--update-env']);
  tryPm2(['startOrReload', artifacts.externalFile, '--update-env']);

  // Clean up drift: retired and unexpected processes
  const drift = buildPm2Drift(catalog, getLivePm2Inventory(catalog));
  for (const name of [...drift.retiredLive, ...drift.unexpected]) {
    tryPm2(['delete', name]);
  }

  // Report
  const after = getLivePm2Inventory(catalog);
  const finalDrift = buildPm2Drift(catalog, after);
  console.log(`PM2 online: ${after.filter(i => i.status === 'online').length}/${after.length}`);
  console.log(`PM2 expected missing: ${finalDrift.missingExpected.length}`);
  console.log(`PM2 retired live: ${finalDrift.retiredLive.length}`);
  console.log(`PM2 unexpected: ${finalDrift.unexpected.length}`);
}

Four drift categories: processes that should exist but don't (missingExpected), processes that should be running but aren't (stoppedExpected), processes we retired but are still alive (retiredLive), and processes we never declared (unexpected). Each category tells you a different kind of problem.

In Production

We've run this control plane since early 2026. Real numbers:

  • 26 health-checked endpoints, all derived from the catalog. Zero forgotten.
  • 12 systemd services and 30+ PM2 processes governed by one manifest.
  • 19 recurring automations with declared schedules and expected runtimes.
  • 5 generated markdown files that stay in sync automatically.
  • 30-minute reconciliation cycle catches drift before it matters.

The pattern caught its first real problem within days: a process we'd retired from the catalog was still running under PM2, consuming ~180MB of memory. The drift detector flagged it as retiredLive, and the reconciler cleaned it up automatically.

Policy checks have caught three separate cases of services running under the wrong user after system updates — something we would never have noticed manually until a permission error hit production.

The biggest operational win is onboarding context. Any developer (or AI agent) can read the generated VPS-CATALOG.md and understand the entire system in 60 seconds: what services exist, what ports they're on, how they authenticate, and whether they're healthy. Before the catalog, that knowledge lived in one person's head.

Edge Cases We Hit

Stale PM2 snapshots. The health checker reads a cached PM2 snapshot instead of running pm2 jlist every time (which is slow). We write the snapshot after every reconciliation and read it during health checks. The tradeoff: drift data can be up to 30 minutes stale. For our purposes, that's acceptable.

Circular restarts. The reconciler restarts mission-control and notification-router after syncing. If those services depend on files that just changed, the restart order matters. We restart them last, after all files are synced.

Generator idempotency. We run the doc generator twice during reconciliation — once before syncing (so the docs are fresh) and once after (so the live workspace has the latest). copyIfChanged ensures only actual differences trigger downstream actions.

Variations

For Kubernetes teams: Replace the JSON manifest with a CRD. Replace the reconciler with a controller. The pattern is identical — one declaration of expected state, one loop that enforces it. The difference is our loop runs as a Node.js script instead of a K8s operator.

For smaller fleets (5-10 services): Skip the PM2 drift detection and process model tracking. Keep the catalog, health checks, and doc generation. Even at 5 services, auto-generated docs that can't drift from reality are worth the 200 lines of code.

For multi-host deployments: The catalog gains a host field per service. The health checker runs remotely instead of against 127.0.0.1. The reconciler becomes a push-based deployment tool. The manifest shape doesn't change much — the execution layer does.

For teams using Terraform/Pulumi: This pattern complements IaC rather than replacing it. Terraform manages cloud resources; the catalog manages application-layer services. They can even share a source — generate the catalog from Terraform outputs.

Conclusion

The catalog-driven control plane is a boring pattern. One JSON file. A loader. A generator. A health checker. A reconciler. No message queues, no service mesh, no distributed consensus.

That's the point. Infrastructure governance doesn't need to be sophisticated. It needs to be complete and consistent. One manifest that declares everything, one loop that checks everything, one generator that documents everything. When those three properties hold, you stop losing services to entropy and start spending your time on problems that actually matter.

If you find yourself maintaining more than ten services and updating health checks by hand, the catalog pattern pays for itself in the first week. The total implementation is under 600 lines of code, and the hardest part is the discipline of editing the manifest every time you add a service.

Need help building AI agent systems or designing multi-agent architectures? Ledd Consulting specializes in autonomous workflow design and agent orchestration for enterprise teams.

Read more

Intelligence Brief — Saturday, April 11, 2026

MetalTorque Daily Brief — 2026-04-11 Cross-Swarm Connections The Audit Trail Is the Attack Surface — Everywhere. Three swarms converged on the same structural conclusion from radically different entry points. Agentic Design found that peer-preservation corrupts agent-generated logs, confidence inflation poisons self-reported metrics, and context contamination makes audit-time behavior diverge from production behavior.

By Ledd Consulting