The Replay Lab Pattern: How We Train Scoring Models by Re-Running Historical Decisions

Every scoring pipeline starts optimistic. You hand-tune some weights, eyeball a few outputs, ship it. Three weeks later someone asks, "Is the new version actually better?" and you realize you have no way to answer that question without waiting another three weeks for fresh data.

At Ledd Consulting, we run four separate scoring pipelines — consulting leads, opportunity ranking, proposal review, and product build prioritization. Each one scores incoming records against a weighted config and decides what deserves attention. We needed a way to improve those configs continuously without splitting live traffic or waiting for slow feedback cycles. So we built what we call Replay Labs: offline evaluation loops that re-run every historical decision against candidate configurations, measure the outcome, and promote winners automatically.

This post walks through the pattern, the real code, and the production metrics that came out of it.

The Pattern — One-Sentence Definition

A Replay Lab loads labeled historical records, scores each one against both the current config and hundreds of mutated candidates using k-fold cross-validation, then promotes the candidate that best predicts known outcomes.

It is A/B testing without traffic splitting. You replay the past instead of gambling with the present.

The Naive Approach (and Why It Fails)

Most teams try one of two things when they want to improve a scoring pipeline:

Manual tuning. A developer looks at a few misranked records, nudges a weight, deploys, and waits. This fails because humans are terrible at multi-dimensional optimization. You fix the false negative that annoyed your PM and introduce three new false positives you won't notice for a week.

Online A/B testing. You split live traffic between the old config and a candidate. This fails for low-volume pipelines. If you're scoring 200 leads a month, you'd need months of traffic to reach statistical significance — and you're sending half your leads through a config you have no confidence in.

Backtesting on a holdout set. Better, but most teams evaluate on the full dataset without cross-validation, which means the winning config is overfit to the exact records it was tuned on.

The Replay Lab pattern solves all three: it runs offline, uses k-fold validation to prevent overfitting, and evaluates hundreds of candidates in seconds.

Pattern Implementation

Step 1: Outcome Labeling

The entire pattern depends on one thing: labeled outcomes. Every record in our pipeline eventually gets an outcome tag — ignored, replied, meeting, proposal, won, lost. We assign a numeric rank to each:

const STRONG_RANK = 7;

// getOutcomeRank maps outcomes to integers:
// ignored=1, bounced=2, replied=4, meeting=7, proposal=8, won=10
// STRONG_RANK threshold separates "worth pursuing" from noise

Without labels, there's nothing to replay against. We started labeling manually, then automated it: when a CRM record moves to meeting or proposal stage, an event fires and the label is written back to the dataset. The key insight is that you don't need perfect labels — you need a directional signal. Even noisy labels converge over enough iterations.

Step 2: The Evaluation Function

The core of the pattern is a function that takes a dataset and a scoring config, runs k-fold cross-validation, and returns an objective score:

const FOLD_COUNT = 5;

function evaluateConfig(entries, config) {
  return evaluateRankingConfig(entries, {
    buildSignal: training => buildHistoricalSignal(training, config),
    foldCount: FOLD_COUNT,
    getKey: entry =>
      entry.externalId || entry.key || entry.company
      || entry.title || entry.timestamp,
    getRank: getOutcomeRank,
    scoreRecord: (entry, rankSignal) =>
      scoreLead(entry, { config, rankSignal }),
    strongRank: STRONG_RANK,
  });
}

The k-fold split is critical. For each fold, the historical signal (how similar past records performed) is built only from the training partition. The held-out partition is scored blind. This prevents the config from memorizing specific records.

buildHistoricalSignal deserves a callout: it takes the training fold and builds a lookup of how records with similar characteristics performed. This signal is then passed into scoreLead alongside the config weights. It's a lightweight form of collaborative filtering — "leads like this one historically converted at X rate."

We don't use gradient descent. Our configs are small enough (a handful of weights plus a few threshold values) that random search over 300 iterations consistently finds strong candidates:

const SEARCH_ITERATIONS = 300;

function mutateConfig(baseConfig, iteration) {
  const candidate = cloneConfig(baseConfig);
  const { next, random } = mutateWeightMap(
    baseConfig.weights, iteration
  );
  candidate.weights = next;
  candidate.highPriorityScore = roundValue(
    6 + random() * 2, 2
  );
  candidate.thresholds.followUpDays = Math.max(
    2, Math.round(2 + random() * 4)
  );
  candidate.thresholds.maxHistoricalTokensPerLead = Math.max(
    6, Math.round(8 + random() * 6)
  );
  return candidate;
}

function searchBestConfig(entries, baseConfig) {
  let bestConfig = cloneConfig(baseConfig);
  let bestResult = evaluateConfig(entries, bestConfig);

  for (let i = 0; i < SEARCH_ITERATIONS; i += 1) {
    const candidate = mutateConfig(baseConfig, i + 1);
    const result = evaluateConfig(entries, candidate);
    if (result.objective > bestResult.objective) {
      bestConfig = candidate;
      bestResult = result;
    }
  }

  return { bestConfig, bestResult };
}

A few design decisions worth noting:

  • Mutation is seeded from the iteration index, not purely random. mutateWeightMap uses the iteration number to control exploration breadth — early iterations explore widely, later ones fine-tune near the current best. This gives us simulated annealing behavior without the complexity.
  • We mutate from the base config every time, not from the current best. This prevents the search from getting trapped in a local optimum early. Every candidate is one mutation away from the production config, not from an intermediate candidate.
  • Thresholds are mutated alongside weights. highPriorityScore, followUpDays, and maxHistoricalTokensPerLead are all part of the search space. Teams often freeze thresholds and only tune weights — we found co-optimizing them yields materially better results.

Step 4: Baseline Comparison and Reporting

Every replay run evaluates the current production config as a baseline, then reports the delta:

const baseline = evaluateConfig(entries, baseConfig);
const { bestConfig, bestResult } = searchBestConfig(entries, baseConfig);

const report = {
  baseline: {
    config: baseConfig,
    metrics: baseline.metrics,
    objective: baseline.objective,
  },
  best: {
    config: bestConfig,
    metrics: bestResult.metrics,
    objective: bestResult.objective,
  },
  dataset: {
    count: entries.length,
    outcomeCounts: entries.reduce((counts, entry) => {
      counts[entry.outcome] = (counts[entry.outcome] || 0) + 1;
      return counts;
    }, {}),
    strongCount: entries.filter(
      entry => getOutcomeRank(entry) >= STRONG_RANK
    ).length,
  },
  foldCount: FOLD_COUNT,
  generatedAt: new Date().toISOString(),
  searchIterations: SEARCH_ITERATIONS,
};

The report captures everything needed for a promotion decision: baseline objective, best objective, the full candidate config, dataset size, and outcome distribution. We write this to both JSON (for programmatic consumption) and Markdown (for the morning briefing digest).

Step 5: Config Promotion

The --write-config flag auto-promotes the winning config back to the production config file:

if (writeConfig) {
  fs.writeFileSync(
    CONFIG_FILE, JSON.stringify(bestConfig, null, 2)
  );
}

We don't always auto-promote. Our scheduled replay runs log the report; a separate autopromote check verifies that the objective improvement exceeds a minimum threshold and that the dataset has enough labeled records before writing the config. This prevents a lucky mutation on a thin dataset from overwriting a stable config.

Step 6: The Scorecard — Tracking Drift Across Domains

A single replay lab covers one scoring pipeline. The scorecard aggregates all four into one view and tracks them over time:

const LAB_FILES = Object.freeze({
  consulting: path.join(ANALYTICS_DIR, 'consulting-replay-lab.json'),
  opportunity: path.join(ANALYTICS_DIR, 'opportunity-replay-lab.json'),
  product_build: path.join(ANALYTICS_DIR, 'product-build-replay-lab.json'),
  proposal_review: path.join(ANALYTICS_DIR, 'proposal-review-replay-lab.json'),
});

The scorecard compares 7-day windows — current versus previous — and surfaces which pipelines improved, which regressed, and which have stale labels. It pulls events from our event bus to count how many scoring events each pipeline processed and cross-references with the replay lab outputs.

This is where the feedback loop closes. If the consulting lead scorer's objective drops week-over-week, the scorecard flags it. A developer (or an automated agent) kicks off a fresh replay lab run, finds a better config, and promotes it. The next scorecard reflects the improvement.

In Production

We run four replay labs on a nightly schedule. Real numbers from our system:

  • 4 scoring domains share the identical replay lab pattern — consulting leads, opportunity ranking, proposal review, and product build prioritization
  • 300 search iterations per domain, completing in under 30 seconds per run on a single VPS
  • 5-fold cross-validation on every evaluation, preventing overfitting on small datasets
  • 7-day comparison windows in the scorecard catch regressions within one weekly cycle

Edge cases we hit:

Thin label sets. Early on, we had 12 labeled records for product build scoring. The 5-fold split left folds with 2-3 records each — statistically meaningless. We added a minimum record count (currently 20) before the autopromote path activates. Below that, the lab still runs and reports, but won't write the config.

Outcome label lag. A lead scored today might not get an outcome label for weeks. The replay dataset is always slightly stale. We handle this by only including records with confirmed outcomes — no imputation, no guessing. The dataset grows slowly but stays clean.

Degenerate mutations. Some random configs score everything as high-priority or everything as low. The k-fold objective function naturally penalizes these because they can't separate strong outcomes from weak ones, but early iterations spent cycles on clearly bad candidates. We added boundary constraints on mutateWeightMap to keep weights in plausible ranges.

Variations

For classification tasks: Replace the ranking objective with precision/recall at your decision threshold. The structure is identical — load labeled data, mutate config, evaluate, promote.

For LLM-based scoring: If your scorer calls an LLM, replay gets expensive. We cache LLM outputs by input hash so replays only re-run the weight/threshold math, not the LLM call. The prompt stays fixed; only the post-processing config is searched.

For multi-stage pipelines: Run a replay lab per stage. Our proposal review pipeline has two stages — initial filter and detailed scoring. Each stage has its own config, its own replay lab, and its own objective. The scorecard tracks both.

For teams with more compute: Replace random search with Bayesian optimization (e.g., Optuna). We stuck with random search because 300 iterations over a small config space is sufficient and introduces zero dependencies. If your config has 50+ parameters, you'll want something smarter.

Conclusion

The Replay Lab pattern gives you three things most scoring pipelines lack: a rigorous evaluation methodology that doesn't require live traffic, a search process that finds better configs automatically, and a scorecard that tracks whether your pipelines are actually improving over time.

The most important part isn't the search algorithm — it's the discipline of labeling outcomes and closing the loop. Once you have labeled data flowing back, the replay lab turns improving your scoring pipeline from a guessing game into a mechanical process: label, replay, compare, promote, repeat.

We've run this pattern across four domains for months. Every week, the scorecard tells us exactly where we stand — no dashboards to interpret, no A/B tests to babysit, no arguments about whether the new weights are "better."

Need help building AI agent systems or designing multi-agent architectures? Ledd Consulting specializes in autonomous workflow design and agent orchestration for enterprise teams.

Read more

Intelligence Brief — Saturday, April 11, 2026

MetalTorque Daily Brief — 2026-04-11 Cross-Swarm Connections The Audit Trail Is the Attack Surface — Everywhere. Three swarms converged on the same structural conclusion from radically different entry points. Agentic Design found that peer-preservation corrupts agent-generated logs, confidence inflation poisons self-reported metrics, and context contamination makes audit-time behavior diverge from production behavior.

By Ledd Consulting