postmortem

When Two Users on One VPS Corrupted Our Shared State — and How We Hardened 25 Services in a Weekend

Ledd Consulting

03 Apr 2026 — 8 min read

Summary

On March 25, 2026, we added a second system user to our single-VPS infrastructure running 25 AI microservices at Ledd Consulting. Within hours, the event bus started silently dropping events, the automation state database threw SQLITE_BUSY errors under load, and two JSON state files were truncated to zero bytes. Total impact: ~40 events lost, three automation runs recorded as phantoms (started but never finished), and a notification digest that shipped with stale data. We traced the root cause to three distinct concurrency failures — a read-modify-write race in our event logger, missing file permissions for cross-user access, and SQLite's default journal mode choking on concurrent connections. Full resolution took from March 25 through April 2, with the critical fixes landing within 48 hours and hardening measures verified by the end of the week.

Timeline

March 25, 10:15 AM — We onboard a second system user (hermes) to run a supervisory agent alongside the primary platform user. Both users need read-write access to the shared automation state database and several JSON state files.

March 25, 11:42 AM — First alert: the drip sequence engine fails with Error: attempt to write a readonly database. The automation state SQLite file was created by platform with default 0o644 permissions — the new user can read but not write.

March 25, 12:10 PM — We chmod the database file to 0o664 and add both users to an agent group. Services restart. The immediate error stops.

March 25, 2:30 PM — The morning briefing compiler reports a gap in event history. Investigation reveals events-2026-03-25.json has only 14 entries when the event bus processed 53 events that day. We initially suspect a logging bug.

March 25, 5:15 PM — A second anomaly: automation-jobs.json is discovered at zero bytes. Mission control dashboard shows no data. The file recovers on the next automation run, but the gap confirms something is clobbering shared files.

March 26, 9:00 AM — Root cause identified: three separate failure modes. The event bus logEvent() function uses a read-modify-write pattern on a JSON array — two concurrent events can overwrite each other. The SQLite database works in rollback journal mode by default, causing writer starvation under concurrent access. And the umask of process creators means new files don't inherit group-write permissions.

March 26–28 — We implement fixes in phases: SQLite WAL mode with busy timeout, atomic temp-file-rename for JSON writes, and explicit permission enforcement on every file operation.

April 2 — Final hardening verified across all shared state surfaces. Notification router, automation runner, and ledger systems all confirmed operational under concurrent access.

Root Cause

Three independent bugs conspired. Any one alone would have been a nuisance. Together, they caused silent data loss.

Bug 1: The Read-Modify-Write Race

Our event bus logged every processed event to a daily JSON file. The pattern looked like this:

function logEvent(event, results) {
  const logFile = getLogFilePath();
  const entry = {
    id: event.id,
    type: event.type,
    source: event.source,
    timestamp: event.timestamp,
    receivedAt: event._receivedAt,
    processedAt: new Date().toISOString(),
    data: event.data,
    results: results.map(r => ({
      action: r.actionName,
      success: r.success,
      statusCode: r.statusCode || null,
      error: r.error || null,
      retries: r.retries || 0,
      durationMs: r.durationMs || 0,
    }))
  };

  try {
    let events = [];
    if (fs.existsSync(logFile)) {
      try {
        events = JSON.parse(fs.readFileSync(logFile, 'utf8'));
      } catch (e) {
        events = [];
      }
    }
    events.push(entry);
    fs.writeFileSync(logFile, JSON.stringify(events, null, 2));
  } catch (err) {
    logError('Failed to write event log', err);
  }
}

This is the textbook race condition. Node.js is single-threaded, yes — but we run multiple processes. The event bus handles HTTP requests, and when two events arrive in quick succession on overlapping requests (or when a timer-triggered service fires at the same moment as an inbound webhook), Process A reads the file, Process B reads the same file, both append their entry, and whichever writes last wins. The other event vanishes.

The dead letter queue had the same pattern:

function addToDeadLetters(event, actionName, error, retries) {
  const entry = {
    id: crypto.randomUUID(),
    eventId: event.id,
    eventType: event.type,
    // ...
  };

  try {
    let deadLetters = [];
    if (fs.existsSync(DEAD_LETTERS_FILE)) {
      try {
        deadLetters = JSON.parse(fs.readFileSync(DEAD_LETTERS_FILE, 'utf8'));
      } catch (e) {
        deadLetters = [];
      }
    }
    deadLetters.push(entry);
    fs.writeFileSync(DEAD_LETTERS_FILE, JSON.stringify(deadLetters, null, 2));
  } catch (err) {
    logError('Failed to write dead letter', err);
  }
}

Same read-modify-write. Same data loss risk. Two places where evidence of system failures could silently disappear.

Bug 2: SQLite Default Journal Mode

Our automation state database — tracking every automation run across 60+ scheduled timers — was opened with no special configuration. SQLite's default rollback journal mode allows exactly one writer at a time, and the default busy timeout is zero. If Process A holds a write lock and Process B tries to write, Process B gets an immediate SQLITE_BUSY error. No retry. Just failure.

With 25 services sharing one database, this was a ticking bomb that detonated the moment we added a second user running concurrent workloads.

Bug 3: Permission Inheritance

On Linux, new files inherit the creating process's umask, not the parent directory's permissions. Our processes ran with the default umask 0022, which means any file created by platform got 0o644 permissions — owner read-write, group read-only. The hermes user, despite being in the same group, couldn't write.

Worse: SQLite in WAL mode creates two companion files (-shm and -wal). If the main database file has correct permissions but these companion files don't, you get the same readonly database error. And these files are recreated whenever the database is reopened, inheriting the creator's umask all over again.

The Fix

We addressed each failure mode with a distinct pattern, then consolidated them into shared utility libraries.

Fix 1: Atomic Writes for JSON State

We replaced every read-modify-write JSON pattern with a temp-file-plus-atomic-rename strategy:

function writeJson(filePath, data) {
  ensureSharedDir(path.dirname(filePath));
  const tempFile = `${filePath}.${process.pid}.${Date.now()}.tmp`;
  fs.writeFileSync(tempFile, `${JSON.stringify(data, null, 2)}\n`, { mode: 0o664 });
  fs.renameSync(tempFile, filePath);
}

The key insight: fs.renameSync is atomic on POSIX systems when source and destination are on the same filesystem. There's no window where the target file is half-written. Either you see the old data or the new data — never a truncated file, never zero bytes.

For append-only files like verification ledgers, we use explicit file descriptor control instead:

function appendJsonLine(filePath, entry) {
  ensureDirs();
  const fd = fs.openSync(filePath, 'a', 0o664);
  try {
    try {
      fs.fchmodSync(fd, 0o664);
    } catch (error) {
      if (!error || !['EPERM', 'EACCES'].includes(error.code)) {
        throw error;
      }
    }
    fs.writeSync(fd, `${JSON.stringify(entry)}\n`, null, 'utf8');
  } finally {
    fs.closeSync(fd);
  }
  ensureSharedPath(filePath, 0o664);
}

Opening with 'a' (append mode) means the OS kernel handles the seek-to-end atomically. The fchmodSync on the open file descriptor fixes permissions regardless of the creator's umask. The finally block guarantees the descriptor is released even on error.

Fix 2: SQLite with WAL and Busy Timeout

We built a shared database opener that every service now uses:

const SHARED_DIR_MODE = 0o2775;
const SHARED_FILE_MODE = 0o664;

function configureSqliteDb(db) {
  db.exec('PRAGMA journal_mode = WAL');
  db.exec('PRAGMA synchronous = NORMAL');
  db.exec('PRAGMA busy_timeout = 5000');
  return db;
}

function openSharedSqliteDb(dbFile) {
  ensureSharedSqliteArtifacts(dbFile);
  const db = configureSqliteDb(new DatabaseSync(dbFile));
  ensureSharedSqliteArtifacts(dbFile);
  return db;
}

Three PRAGMAs, each doing specific work:

journal_mode = WAL: Write-Ahead Logging allows multiple concurrent readers alongside a single writer, instead of locking the entire database for every read.
synchronous = NORMAL: Balances durability with performance. We're not running a bank — we can tolerate losing the last transaction on a power failure in exchange for 10x fewer fsync calls.
busy_timeout = 5000: Instead of failing immediately on SQLITE_BUSY, SQLite retries for up to 5 seconds. This eliminates virtually all contention errors in our workload.

Notice ensureSharedSqliteArtifacts is called twice — before and after opening the database. The database open itself can create new -shm and -wal files, and those need their permissions fixed immediately:

function ensureSharedSqliteArtifacts(dbFile) {
  ensureSharedDir(path.dirname(dbFile));
  for (const target of [dbFile, `${dbFile}-shm`, `${dbFile}-wal`]) {
    ensurePathMode(target, SHARED_FILE_MODE);
  }
}

Fix 3: Permission Enforcement at Every Layer

We couldn't trust the umask. Instead, we enforce permissions explicitly at every file operation:

function ensurePathMode(target, mode) {
  try {
    if (!fs.existsSync(target)) return;
    const currentMode = fs.statSync(target).mode & 0o7777;
    if (currentMode !== mode) {
      fs.chmodSync(target, mode);
    }
  } catch (error) {
    if (!error || !['EPERM', 'EACCES'].includes(error.code)) {
      throw error;
    }
  }
}

Directories get 0o2775 — the setgid bit (2) means new files inherit the directory's group ownership instead of the creator's primary group. Files get 0o664 — read-write for owner and group. The EPERM/EACCES catch is deliberate: if a process can't fix permissions (because it's not the owner), it swallows the error rather than crashing. The file is probably already correct if another process created it with the right settings.

Fix 4: Transactional Operations

For complex state mutations, we wrapped SQLite operations in explicit transactions:

replaceObject(namespace, object) {
  const entries = Object.entries(object || {});
  const remove = this.db.prepare(
    `DELETE FROM ${this.tableName} WHERE namespace = ?`
  );
  const insert = this.db.prepare(`
    INSERT INTO ${this.tableName} (namespace, item_key, value_json, updated_at)
    VALUES (?, ?, ?, CURRENT_TIMESTAMP)
  `);

  this.db.exec('BEGIN');
  try {
    remove.run(namespace);
    for (const [key, value] of entries) {
      insert.run(namespace, key, JSON.stringify(value));
    }
    this.db.exec('COMMIT');
  } catch (err) {
    try {
      this.db.exec('ROLLBACK');
    } catch (_) {}
    throw err;
  }
}

The delete-then-reinsert pattern without a transaction would leave a window where the namespace is empty. With BEGIN/COMMIT, other readers see either the old state or the new state — never an empty table.

Prevention

We implemented four systemic changes to prevent recurrence:

1. Shared utility libraries. Every service that touches shared state must use openSharedSqliteDb() for databases and writeJson() or writeJsonMirror() for JSON files. No more inline fs.writeFileSync to shared paths.

2. Dual-storage pattern. Critical state now lives in SQLite (for atomicity) with a JSON mirror (for debuggability). The json-file-mirror.js module writes to both:

function writeJsonMirror(filePath, value) {
  ensureDir(path.dirname(filePath));
  const tempFile = `${filePath}.${process.pid}.${Date.now()}.tmp`;
  const serialized = `${JSON.stringify(value, null, 2)}\n`;
  fs.writeFileSync(tempFile, serialized, { mode: 0o664 });
  fs.renameSync(tempFile, filePath);
}

SQLite is the source of truth. The JSON file is a human-readable shadow that dashboards and debugging tools can read without a database driver.

3. Backup-restore drills. Our weekly backup drill now verifies the SQLite database can be restored and queried — checking row counts in automation_runs and automation_kv tables. If a backup contains a corrupted database, we know before we need it.

4. Operational rule for manual repairs. Any time someone manually edits a shared state file, the first thing they must do afterward is restore shared ownership and permissions. This is documented in our system state manifest and enforced by convention.

Lessons for Your Team

The "single-threaded" safety net has holes. Node.js developers often assume concurrency bugs can't happen because JavaScript is single-threaded. But the moment you have multiple processes — systemd services, cron jobs, worker processes — you have classic concurrency problems. Treat every shared file as a critical section.

Atomic rename is your best friend for JSON state. The temp-file-plus-rename pattern is simple, requires no external dependencies, works on every POSIX system, and eliminates an entire class of corruption bugs. There is almost no reason to writeFileSync directly to a shared state file.

SQLite needs three PRAGMAs for multi-process use. The defaults (journal_mode=DELETE, synchronous=FULL, busy_timeout=0) are optimized for single-process embedded use. If multiple processes share a SQLite database, configure WAL mode and a busy timeout before any writes. This is a five-line fix that prevents hours of debugging.

Don't trust the umask. If two users or two service accounts need to share files, explicit chmod after every create operation is the only reliable approach. The setgid bit on directories helps with group ownership, but file mode still depends on the creator's umask.

Silent data loss is worse than a crash. Our event bus didn't error — it just quietly dropped events when two writes raced. The dead letter queue — the very mechanism designed to catch failures — had the same bug. Prefer patterns that fail loudly (SQLite transactions, atomic renames that throw on failure) over patterns that silently produce wrong results.

Conclusion

This incident taught us that the gap between "works with one user" and "works with two users" is wider than it looks. Every implicit assumption about file ownership, write ordering, and lock behavior became an explicit bug the moment we added a second process owner. The fixes weren't exotic — WAL mode, atomic renames, explicit permissions — but finding all the places they were needed across 25 services required methodical auditing.

The good news: once we built the shared utility layer (json-sqlite-store.js, writeJsonMirror, appendJsonLine), hardening a service was a one-line change — swap fs.writeFileSync for the safe equivalent. The hardest part wasn't writing the fix. It was finding every place the old pattern lived.

Need help building AI agent systems or designing multi-agent architectures? Ledd Consulting specializes in autonomous workflow design and agent orchestration for enterprise teams.