When Our Notification System Said "Sent" But Nobody Got the Email

Summary

On March 28, 2026, we discovered that our notification router and event bus had been silently dropping operator emails for approximately 36 hours. The root cause: we trusted the exit code of our mail CLI tool (himalaya) as proof of delivery. The tool returned 0 (success) even when the SMTP handshake completed but Gmail rejected the message at the transport layer. No alerts fired. No errors logged. Operator-critical digests — including overnight activity summaries and verification confirmations — vanished into the void. Resolution took 14 hours from detection to deployed fix, and we now verify every outbound email by confirming its existence in the [Gmail]/Sent Mail folder before marking it delivered.

Timeline

March 27, 08:00 UTC — Routine. Morning briefing email arrives on schedule. All 25 services report healthy. Nothing unusual in logs.

March 27, ~14:00 UTC — Unknown at the time: a Gmail OAuth token refresh returns a degraded session. The himalaya CLI continues to accept message send commands and returns exit code 0, but messages are silently rejected after the SMTP DATA command.

March 28, 08:15 UTC — Morning briefing doesn't arrive. We initially assume it's a timer issue — we run 60+ scheduled timers on a single VPS, and occasional scheduling drift happens.

March 28, 08:40 UTC — Manual check of the notification router logs shows the briefing was "sent successfully" at 07:00. Logs look clean. No errors, no warnings, no retries.

March 28, 09:00 UTC — We check Gmail's Sent folder directly. The briefing isn't there. Neither are 23 other operator notifications from the past 18 hours.

March 28, 09:15 UTC — Scope of impact confirmed: every email sent through email-delivery.js since ~14:00 the previous day shows as delivered in our system but doesn't exist in Gmail. 23 missed notifications including event-bus alerts, digest summaries, and two lead-scoring notifications.

March 28, 10:00 UTC — Root cause identified: himalaya message send returns 0 whenever it successfully hands the message to the SMTP transport, regardless of whether the remote server accepted it for delivery.

March 28, 14:00–22:00 UTC — We build and deploy the verification-ledger pattern: stamp every outbound email with a unique delivery ID, then confirm it actually landed in Sent Mail before treating it as delivered.

March 28, 23:00 UTC — Verification system live in production. Backlog of 23 failed notifications re-sent and confirmed.

Root Cause

The failure was a classic trust-boundary violation: we treated a local operation's success code as proof of a remote operation's completion.

Here's what our email delivery looked like before the incident:

// email-delivery.js — BEFORE (the version that lied to us)
const { execSync } = require("child_process");

async function sendOperatorEmail({ to, subject, body }) {
  const tmpFile = `/tmp/email-${Date.now()}.txt`;
  fs.writeFileSync(tmpFile, body);

  try {
    execSync(
      `himalaya message send --to "${to}" --subject "${subject}" --body-file "${tmpFile}"`,
      { timeout: 30000 }
    );
    logger.info(`Email sent: "${subject}" -> ${to}`);
    return { success: true, sentAt: new Date().toISOString() };
  } catch (err) {
    logger.error(`Email failed: ${err.message}`);
    return { success: false, error: err.message };
  } finally {
    fs.unlinkSync(tmpFile);
  }
}

The problem is subtle. execSync throws only when the process exits with a non-zero code. The himalaya CLI exits 0 after it completes the SMTP transaction — meaning it handed the bytes to the server. But "the server accepted the TCP payload" is not the same as "the email was delivered." When the OAuth session degraded, Gmail's SMTP endpoint accepted the connection, received the DATA payload, and then quietly discarded it. No bounce. No error. Exit code 0.

Every downstream consumer — our notification router, event bus, morning briefing, lead-scoring alerts — called sendOperatorEmail(), got { success: true }, logged "sent successfully," and moved on.

We had 25 services trusting this single function. Not one of them had any way to know it was lying.

The Fix

The fix has two parts: delivery stamping and sent-folder verification.

Part 1: Stamp Every Outbound Email

Every email now gets a unique X-Ledd-Delivery-Id header injected before sending:

// email-delivery.js — AFTER
const crypto = require("crypto");

function generateDeliveryId() {
  return `ledd-${Date.now()}-${crypto.randomBytes(4).toString("hex")}`;
}

async function sendOperatorEmail({ to, subject, body }) {
  const deliveryId = generateDeliveryId();
  const headers = `X-Ledd-Delivery-Id: ${deliveryId}`;
  const tmpFile = `/tmp/email-${deliveryId}.txt`;

  // Prepend custom header to the raw message
  const rawMessage = `${headers}\n\n${body}`;
  fs.writeFileSync(tmpFile, rawMessage);

  try {
    execSync(
      `himalaya message send --to "${to}" --subject "${subject}" --body-file "${tmpFile}"`,
      { timeout: 30000 }
    );

    // CLI said "sent" — but we don't believe it anymore
    const verified = await verifySentMail(deliveryId, subject);

    if (verified) {
      appendToLedger("verification-ledgers", {
        deliveryId,
        subject,
        to,
        status: "verified",
        verifiedAt: new Date().toISOString(),
      });
      logger.info(`Email verified in Sent Mail: ${deliveryId}`);
      return { success: true, deliveryId, verified: true };
    } else {
      // The CLI lied. Retract the claim.
      appendToLedger("claim-ledgers", {
        deliveryId,
        subject,
        to,
        status: "retracted",
        reason: "not_found_in_sent_mail",
        retractedAt: new Date().toISOString(),
      });
      emitEvent("claim.retracted", { deliveryId, subject, to });
      emitEvent("verification.failed", { deliveryId, subject, to });
      logger.warn(`Email NOT verified — retracted: ${deliveryId}`);
      return { success: false, deliveryId, verified: false, retracted: true };
    }
  } catch (err) {
    logger.error(`Email send error: ${err.message}`);
    return { success: false, error: err.message };
  } finally {
    fs.unlinkSync(tmpFile);
  }
}

Part 2: Verify Against Gmail's Sent Folder

The verification function searches the Sent Mail folder for the delivery ID. This is the actual proof — if Gmail has it in Sent, it was delivered. If not, it wasn't.

async function verifySentMail(deliveryId, subject, retries = 3) {
  // Give Gmail a moment to index the sent message
  await sleep(2000);

  for (let attempt = 1; attempt <= retries; attempt++) {
    try {
      // List recent envelopes in Sent Mail
      const envelopes = execSync(
        `himalaya envelope list "[Gmail]/Sent Mail" --page-size 20`,
        { timeout: 15000 }
      ).toString();

      // Find envelopes matching our subject
      const candidates = parseEnvelopes(envelopes).filter((e) =>
        e.subject.includes(subject.slice(0, 40))
      );

      // Export full message and check for our delivery ID header
      for (const candidate of candidates) {
        const fullMessage = execSync(
          `himalaya message export --full "${candidate.id}"`,
          { timeout: 10000 }
        ).toString();

        if (fullMessage.includes(deliveryId)) {
          return true;
        }
      }

      if (attempt < retries) {
        await sleep(3000 * attempt); // Back off between retries
      }
    } catch (err) {
      logger.warn(
        `Verification attempt ${attempt} failed: ${err.message}`
      );
    }
  }

  return false;
}

Part 3: The Ledger Trail

Every delivery outcome — verified or retracted — gets appended to a JSONL ledger file. This gives us an audit trail we can query after the fact:

function appendToLedger(ledgerType, record) {
  const dir = path.join(MEMORY_DIR, ledgerType);
  fs.mkdirSync(dir, { recursive: true });

  const today = new Date().toISOString().split("T")[0];
  const file = path.join(dir, `${today}.jsonl`);

  fs.appendFileSync(file, JSON.stringify(record) + "\n");
}

A sample verification ledger entry:

{"deliveryId":"ledd-1711648200000-a3f2b1c9","subject":"Morning Briefing — Mar 29","to":"operator@example.com","status":"verified","verifiedAt":"2026-03-29T07:00:14.221Z"}

A sample retraction:

{"deliveryId":"ledd-1711648800000-d4e1c0a8","subject":"Lead Score Alert","to":"operator@example.com","status":"retracted","reason":"not_found_in_sent_mail","retractedAt":"2026-03-29T07:15:02.887Z"}

Prevention

We made four systemic changes beyond the immediate fix:

1. Operational rule: no downstream completion without verification. Our notification router and event bus both consumed sendOperatorEmail(). Now neither marks a notification digest or event-bus email action as "complete" unless Gmail-side verification succeeds. If verification fails, the action stays in a pending_retry state and gets re-attempted on the next cycle.

2. Verification events flow through the event bus. Both verification.failed and claim.retracted events are emitted through our broker-less event bus (the same one that connects our 25 services). This means any service can subscribe to delivery failures. Today, our mission control dashboard surfaces them in real-time.

3. Claim retraction as a first-class concept. This was the harder design decision. Before this incident, our system had a binary model: an action either succeeded or failed. Now we have a third state — retracted — meaning the system initially reported success but later determined that claim was false. The retraction record links back to the original delivery ID, creating a full evidence chain.

4. Daily ledger reconciliation. A nightly job compares the verification ledger against expected sends (derived from timer schedules and event-bus triggers). Any expected email that lacks a verification record triggers an alert. This catches the scenario where the entire send-and-verify pipeline silently fails to run.

Lessons for Your Team

Exit code zero means nothing about remote systems. This applies far beyond email. If you're calling any external CLI tool — a deploy script, a database migration runner, an API client — and treating its exit code as proof that a remote operation completed, you have the same bug we did. The fix is always the same: verify the outcome independently of the tool that performed the action.

Separate "attempted" from "confirmed." Most notification systems log "email sent" at the moment they hand the message off. That log line creates a false sense of reliability. If you can't verify delivery through a second channel (Sent folder, webhook receipt, read confirmation), at minimum label it "attempted" rather than "sent" so your on-call team doesn't waste 40 minutes assuming the email system is working.

Build retraction into your data model early. We had to retrofit the concept of "we said this happened but it actually didn't." If your system makes claims about real-world actions — "email sent," "webhook delivered," "payment processed" — design for the possibility that those claims are wrong from day one. A retraction ledger is cheap to maintain and invaluable during incident response.

JSONL ledgers beat database tables for audit trails. We chose append-only JSONL files over database writes for two reasons: they survive database outages (which is exactly when you most need your audit trail), and they're trivially greppable during an incident. One file per day, one line per event, grep gets you answers in seconds.

Silent failures are worse than loud ones. The scariest thing about this incident wasn't the 23 dropped emails. It was the 36 hours where our system confidently reported everything was fine. A hard crash at 14:00 on March 27 would have been fixed in 20 minutes. A silent success that wasn't actually successful took 18 hours to even notice. Invest your reliability budget in detecting silent failures — they're the ones that actually hurt.

Conclusion

This incident cost us 36 hours of notification blindness and 23 dropped operator emails. The technical fix was straightforward — verify outbound emails against the Sent folder before claiming delivery. But the architectural lesson was deeper: any system that reports success based solely on a local operation's exit code is lying to you about the state of the world. Our verification-ledger pattern adds about 4 seconds of latency per email send. That's a trade we'd make every time.

The verification and claim-retraction patterns we built here have since propagated to other parts of our system. Anywhere we make a claim about a real-world side effect — "webhook delivered," "file uploaded," "report generated" — we now verify independently and maintain a retraction trail. The audit surface this creates has already caught two more silent failures in other services, both before they impacted operations.

Need help building AI agent systems or designing multi-agent architectures? Ledd Consulting specializes in autonomous workflow design and agent orchestration for enterprise teams.

Read more

Intelligence Brief — Saturday, April 11, 2026

MetalTorque Daily Brief — 2026-04-11 Cross-Swarm Connections The Audit Trail Is the Attack Surface — Everywhere. Three swarms converged on the same structural conclusion from radically different entry points. Agentic Design found that peer-preservation corrupts agent-generated logs, confidence inflation poisons self-reported metrics, and context contamination makes audit-time behavior diverge from production behavior.

By Ledd Consulting