When Our Drip Sequence Engine Sent 3 Weeks of Emails in One Morning

At Ledd Consulting we run 25 services and 60+ scheduled timers on a single VPS. Most of the time, that architecture is a strength — low cost, low latency, easy to reason about. But on the morning of February 18th, it turned into a liability. A routine VPS restart caused our drip sequence engine to evaluate every active lead's timeline from scratch, decide that multiple steps were "due," and draft follow-up emails for 3 weeks' worth of sequences in a single 4-minute run. Fourteen leads received 2–4 emails each. Three of them were prospects we'd been carefully nurturing over 10-day arcs. We detected the blast within 11 minutes, but the damage to cadence was done. This is the story of what broke, why, and the three-layer fix we shipped the same afternoon.

Timeline

06:01 UTC — VPS reboots after a kernel security patch. All systemd timers restart cleanly. Services come up in dependency order.

06:03 UTCdrip-sequences.timer fires its daily run. The service starts normally, loads CRM data, loads sequence state from SQLite.

06:04 UTC — The engine processes 31 active leads across three sequences. Instead of finding 4–5 leads with a single step due, it drafts 38 emails — every pending step for every lead whose enrollment date now puts them past multiple day thresholds.

06:07 UTCclient-nurture.timer fires. It finds 6 "cold" leads eligible for follow-up (the COLD_LEAD_DAYS threshold of 3 days) and drafts another batch, some overlapping with drip sequence leads.

06:14 UTC — Our notification router delivers the morning digest. The digest shows 38 drip drafts and 6 nurture drafts queued — against a normal baseline of 3–5 total. The anomaly is obvious.

06:25 UTC — Manual intervention. We kill the pending queue before any drafts reach approval. But 14 emails from the drip engine had already been queued to the drafts folder with status: pending_approval, and 3 had been picked up by our send pipeline.

06:40 UTC — Root cause identified. Fix designed.

09:30 UTC — Three-layer fix deployed. All timers restarted. Verified with --preview dry run.

Root Cause

The drip engine's core scheduling logic calculates which step a lead is "due for" by comparing the lead's enrollment date against day offsets defined in each sequence. Here's the sequence definition for our website lead funnel:

const SEQUENCES = {
  website_lead: {
    name: 'New Website Lead',
    trigger: (lead) =>
      lead.platform === 'inbound' ||
      (lead.notes && lead.notes.toLowerCase().includes('website')),
    steps: [
      { day: 0, subject: 'Thanks for reaching out — {company}', prompt_context: 'welcome_acknowledgment' },
      { day: 2, subject: 'Quick case study you might find relevant', prompt_context: 'case_study' },
      { day: 5, subject: 'Free resource for {company}', prompt_context: 'free_resource' },
      { day: 10, subject: 'Quick check-in — {company}', prompt_context: 'last_touch' },
    ],
  },

  consulting_outbound: {
    name: 'Consulting Outreach Follow-up',
    trigger: (lead) => isConsultingOutreachLead(lead),
    steps: [
      { day: 3, subject: 'Following up — {company}', prompt_context: 'consulting_followup' },
      { day: 7, subject: 'Thought you might find this useful', prompt_context: 'consulting_resource' },
      { day: 14, subject: 'Close the loop? — {company}', prompt_context: 'consulting_close_loop' },
    ],
  },
};

The engine's state tracked which step index each lead had completed. On a normal day, the logic was straightforward: "Lead X is on step 1. Step 2 is due on day 5. It's been 5 days since enrollment. Draft step 2." The problem was in what happened when the engine hadn't run for a day — or, in our case, when the state loaded cleanly but the "last run" timestamp was absent.

The original step-evaluation loop looked like this:

function getNextStep(lead, sequenceDef, state) {
  const enrollment = new Date(state.enrolledAt);
  const daysSinceEnrollment = Math.floor(
    (Date.now() - enrollment.getTime()) / 86400000
  );
  const completedStep = state.lastCompletedStep || -1;

  for (let i = completedStep + 1; i < sequenceDef.steps.length; i++) {
    if (daysSinceEnrollment >= sequenceDef.steps[i].day) {
      return i; // This step is due
    }
  }
  return null;
}

See the bug? The function returns the first step after the last completed one whose day threshold has been met. That's correct for advancing one step at a time. But on a normal run, if the engine hadn't run yesterday (or if it ran but didn't persist a checkpoint), the function would return step completedStep + 1 — and then on the next iteration of the outer processing loop, it would find completedStep + 2 also eligible.

The outer loop called getNextStep, processed it, updated lastCompletedStep in memory, and then called getNextStep again for the same lead to check if another step was also due. This was intentional — it was meant to handle the case where the engine missed a day and needed to catch up by one step. But it had no upper bound. A lead enrolled 14 days ago with lastCompletedStep: 0 would get steps 1, 2, and 3 all drafted in a single run.

The client nurture system had its own version of this problem. Its cold-lead detection used a simple daysSince threshold:

const coldLeads = leads.filter(l => {
  const coldStages = ['new', 'contacted', 'proposal'];
  const stage = (l.stage || 'new').toLowerCase();
  const daysCold = daysSince(l.lastContact || l.updatedAt || l.createdAt);
  const alreadySent = queue.sent.some(
    s => s.leadId === l.id && daysSince(s.sentAt) < 7
  );
  return coldStages.includes(stage) && daysCold >= COLD_LEAD_DAYS && !alreadySent;
});

The alreadySent check looked at the queue.sent array — but that array was only populated after a draft was approved and actually sent. Drafts that were queued but not yet sent didn't count. So if the previous run's drafts were still sitting in pending_approval status when the engine restarted, those leads looked eligible again.

The Fix

We shipped three layers of protection that afternoon.

Layer 1: Last-Sent Watermarking

Every lead's state now records not just the completed step index, but the ISO timestamp of when we last drafted any content for them. The engine refuses to draft a new step if the last draft was less than 24 hours ago, regardless of what the day math says.

function getNextStep(lead, sequenceDef, state) {
  const enrollment = new Date(state.enrolledAt);
  const daysSinceEnrollment = Math.floor(
    (Date.now() - enrollment.getTime()) / 86400000
  );
  const completedStep = state.lastCompletedStep || -1;

  // Layer 1: Watermark guard — never draft twice in 24h
  if (state.lastDraftedAt) {
    const hoursSinceLastDraft =
      (Date.now() - new Date(state.lastDraftedAt).getTime()) / 3600000;
    if (hoursSinceLastDraft < 24) return null;
  }

  // Only advance ONE step per run, never skip ahead
  const nextIndex = completedStep + 1;
  if (nextIndex >= sequenceDef.steps.length) return null;

  if (daysSinceEnrollment >= sequenceDef.steps[nextIndex].day) {
    return nextIndex;
  }
  return null;
}

The critical change: we removed the for loop entirely. The function now only evaluates the single next step. No catch-up. If the engine missed day 2 and day 5 is also due, it drafts step 2 today and step 3 tomorrow. Leads experience the intended cadence gaps even if the system was down.

Layer 2: Cold-Start Replay Protection

We added a run-level checkpoint that records when the engine last completed a full run. On startup, if the gap between now and the last run exceeds a configurable threshold, the engine enters "reconciliation mode" — it logs what it would have done but takes no action, and emits an alert.

const RUN_CHECKPOINT_KEY = 'drip_last_successful_run';
const MAX_COLD_START_GAP_HOURS = 36;

async function checkColdStart(store) {
  const lastRun = await store.get(RUN_CHECKPOINT_KEY);
  if (!lastRun) {
    console.warn('[drip] No previous run checkpoint found — cold start detected');
    return true;
  }

  const hoursSince =
    (Date.now() - new Date(lastRun.timestamp).getTime()) / 3600000;
  if (hoursSince > MAX_COLD_START_GAP_HOURS) {
    console.warn(
      `[drip] Last run was ${hoursSince.toFixed(1)}h ago — cold start mode`
    );
    return true;
  }
  return false;
}

// In main():
const isColdStart = await checkColdStart(stateStore);
if (isColdStart && !process.argv.includes('--force')) {
  log('COLD START: Running in preview-only mode. Use --force to override.');
  process.argv.push('--preview'); // Redirect to dry-run
  await notifyOperator('Drip engine cold start detected — running preview only');
}

This means any restart after more than 36 hours triggers a safety net. The engine tells us what it wants to do, and we decide whether to --force it.

Layer 3: Idempotent Delivery Gates

The client nurture system's duplicate problem came from ignoring pending drafts. We fixed this by checking both sent and queued status:

const coldLeads = leads.filter(l => {
  const coldStages = ['new', 'contacted', 'proposal'];
  const stage = (l.stage || 'new').toLowerCase();
  const daysCold = daysSince(l.lastContact || l.updatedAt || l.createdAt);

  // Check BOTH sent and queued — not just sent
  const alreadyHandled = queue.sent.some(
    s => s.leadId === l.id && daysSince(s.sentAt) < 7
  ) || queue.queued.some(
    q => q.leadId === l.id && daysSince(q.queuedAt) < 7
  );

  return coldStages.includes(stage) && daysCold >= COLD_LEAD_DAYS && !alreadyHandled;
});

We also added a global per-run cap to both systems:

const MAX_DRAFTS_PER_RUN = 5;
const MAX_FOLLOWUPS_PER_RUN = 5;

These caps existed in the nurture system already but were missing from the drip engine. Now both systems hard-stop after 5 drafts in a single execution, regardless of how many leads are eligible.

Prevention

Beyond the code fix, we added three systemic guardrails:

1. Run checkpoint persistence. Every timer-based service now writes a lastSuccessfulRun timestamp to our shared SQLite state store on clean exit. The morning briefing system — which already synthesizes data from all 25 services — now flags any service whose last-run timestamp is more than 2× its expected interval.

2. Blast radius alerting. Our notification router now tracks rolling averages of draft/email volume per service. If any single run produces more than 3× the 7-day rolling average, it holds the queue and alerts the operator before delivery. The 38-draft run would have been caught by this gate instantly — our 7-day average was 4.2 drafts per day.

3. Preview-mode on first boot. All drip and nurture timers now run with --preview on their first execution after a systemd restart. A separate "confirm" timer runs 10 minutes later and re-invokes with --force only if the preview output looks nominal. This gives us a 10-minute window to intervene after any restart.

Lessons for Your Team

Timer-based automation is a state machine, not a cron job. If your system calculates "what's due" from wall-clock time every run, you've built a system that can replay its entire history after a restart. The fix is to track what you've done, not just what's due.

"Catch-up" logic is a loaded gun. We intentionally built the multi-step catch-up loop because we thought missed days were the bigger risk. They weren't. A missed day costs you one delayed email. A replay costs you a prospect relationship. Design for the worse failure mode.

Pending work is work-in-progress, not work-not-started. The nurture system's duplicate bug came from only checking completed work when filtering eligible leads. Any system with a queue needs to treat "queued but not yet executed" as a reservation, not a ghost.

Cold-start protection should be the default for any stateful timer. If your service reads state from disk and makes decisions based on elapsed time, it needs to know whether it's waking up from a 1-hour nap or a 3-day coma. Those are different situations requiring different behaviors.

Volume caps are free insurance. A hard cap of 5 drafts per run costs you nothing in normal operation — our daily volume never exceeded 5 anyway. But it converts a catastrophic blast into a bounded incident. Every timer-driven system that produces external side effects should have an absolute per-run ceiling.

Conclusion

The February 18th incident didn't cause lasting damage — we caught it fast enough, and the emails that did go out were well-written drafts, not spam. But it exposed a class of bug that's invisible in normal operation and catastrophic on restart. The fix wasn't complex. Three guards — watermarking, cold-start detection, and idempotent delivery gates — took about 90 minutes to implement and have prevented two near-misses since (both during planned maintenance windows).

If you're running drip sequences, nurture campaigns, or any timer-driven automation that produces external actions, ask yourself: what happens if this service restarts after being down for 72 hours? If the answer is "it replays everything," you have a bug — you just haven't triggered it yet.

Need help building AI agent systems or designing multi-agent architectures? Ledd Consulting specializes in autonomous workflow design and agent orchestration for enterprise teams.

Read more

Intelligence Brief — Saturday, April 11, 2026

MetalTorque Daily Brief — 2026-04-11 Cross-Swarm Connections The Audit Trail Is the Attack Surface — Everywhere. Three swarms converged on the same structural conclusion from radically different entry points. Agentic Design found that peer-preservation corrupts agent-generated logs, confidence inflation poisons self-reported metrics, and context contamination makes audit-time behavior diverge from production behavior.

By Ledd Consulting