model routing

Why We Route 44 Task Types Across 3 Model Tiers Instead of Always Using the Best Model

Ledd Consulting

24 Feb 2026 — 8 min read

Context — The Decision That Shapes Every API Call

When you're running 25+ microservices, 7 research research pipeline, and 60+ scheduled timers — all making LLM calls — model selection stops being a one-time configuration choice. It becomes an architectural decision that compounds across every single inference call your system makes.

At Ledd Consulting, we hit this inflection point around service number fifteen. We had a daily pipeline that kicks off at 1:00 AM EST: five to seven research research pipeline fan out across the web, an action extractor compresses reports at 2:00 AM, a cross-research pipeline synthesizer builds a daily brief at 2:15 AM, a knowledge tracker updates the cumulative base at 2:30 AM, and a builder kicks off autonomous work at 3:00 AM. By morning, a briefing email lands in the inbox at 7:00 AM, a playbook generates proposals and outreach at 9:00 AM, and a job digest pipeline has already scored and deduplicated listings at 6:00 AM.

Every one of those steps calls an LLM. The question wasn't whether to use AI — it was which model for which task, and the difference between getting that right and getting it wrong was the difference between a system that scales and one that bleeds money or crawls.

This is the architecture decision record for how we solved it.

Options Considered

Option 1: Single Model for Everything (Opus/GPT-4 Class)

The simplest approach. Pick the most capable model and use it everywhere. No routing logic, no classification, no maintenance overhead.

Pros: Maximum quality on every task. Zero routing complexity. No risk of under-specifying a task to a weak model.

Cons: Brutal on cost and latency. When your system health-check agent runs every 10 minutes and your CRM updater fires on every pipeline event, you're burning top-tier tokens on tasks that a much smaller model handles identically. Our overnight pipeline alone would generate hundreds of Opus-tier calls for tasks like "extract structured data from this job listing" or "classify this lead as warm/cold." Latency matters too — Opus-class models take 3-5x longer on simple tasks where speed is the real constraint.

Option 2: Single Model for Everything (Haiku/GPT-4o-mini Class)

The opposite extreme. Use the fastest, cheapest model and accept the quality trade-offs.

Pros: Lowest cost. Fastest responses. Simplest architecture.

Cons: Quality collapses on tasks that actually need reasoning. We tested this early. Haiku is excellent at classification, data extraction, and reformatting — but ask it to draft a nuanced cold outreach email that needs to convert, or synthesize five competing market intelligence reports into a coherent strategic brief, and the output is noticeably worse. Our blog content and strategic planning outputs were unusable at this tier. The whole point of running autonomous agents is that their output quality has to survive without human editing on the critical path.

Option 3: Dynamic Routing Based on Task Type (What We Chose)

Maintain a routing table that maps every task type in the system to the appropriate model tier. Simple lookup, no ML-based routing, no runtime complexity.

Pros: Right-sized model for every task. Measurably better cost profile. Explicit, auditable routing decisions. Easy to override for specific tasks as models improve.

Cons: Requires upfront classification of every task type. Routing table needs maintenance as new tasks are added. You're making judgment calls about where the quality threshold falls — and sometimes you're wrong.

Option 4: Adaptive Routing with Quality Scoring

Route to the cheapest model first, evaluate output quality with a scoring function, and retry on a higher tier if it falls below threshold.

Pros: Theoretically optimal — you only pay for quality when you need it.

Cons: In practice, this doubles your latency on every task that needs an upgrade (which in our testing was about 40% of Sonnet-tier tasks). The quality scoring function itself requires LLM calls or complex heuristics. And retry-based architectures are miserable to debug in production — when your 2:00 AM action extractor retries three times and the 2:15 AM synthesizer is waiting on its output, you've built a cascading delay into your pipeline. We prototyped this and killed it within a week.

Decision Criteria

We evaluated against four criteria, in order of priority:

1. Output quality on high-stakes tasks must be indistinguishable from always-Opus. Our blog posts, strategic playbooks, and client-facing communications are revenue-generating artifacts. Degraded quality here has a direct business cost that dwarfs any inference savings.

2. Latency on high-frequency tasks must stay under 5 seconds. Health checks, CRM updates, classifications, and data extractions run constantly. If the model routing adds latency to these, it undermines the responsiveness of the entire agent fleet.

3. The routing logic must be auditable and overridable in under 30 seconds. When we discover a task is misrouted — and we have, multiple times — we need to fix it by changing one line, not retraining a classifier.

4. Zero additional infrastructure. We already run 30 services on a single VPS. The routing solution cannot require another service, another database, or another point of failure.

Our Decision — Static Routing Table with Three Tiers

We built a single-file model router that every service in the system imports. Here's the actual production code:

const ROUTES = {
  // --- Tier 1: Haiku (fast, structured, simple tasks) ---
  'exploration':      'haiku',
  'content-reformat': 'haiku',
  'data-extraction':  'haiku',
  'classification':   'haiku',
  'health-check':     'haiku',
  'social-post':      'haiku',
  'email-draft':      'haiku',
  'daily-plan':       'haiku',
  'daily-reflection': 'haiku',
  'prospect-search':  'haiku',
  'crm-update':       'haiku',

  // --- Tier 2: Sonnet (balanced reasoning, most tasks) ---
  'synthesis':        'sonnet',
  'proposal-draft':   'sonnet',
  'bid-review':       'sonnet',
  'cold-outreach':    'sonnet',
  'meeting-prep':     'sonnet',
  'analysis':         'sonnet',
  'action-extract':   'sonnet',
  'system-recs':      'sonnet',
  'knowledge-track':  'sonnet',
  'client-comms':     'sonnet',
  'follow-up':        'sonnet',
  'contract':         'sonnet',

  // --- Tier 3: Opus (deep reasoning, long-form, high-stakes) ---
  'blog-post':        'opus',
  'strategy':         'opus',
  'book-writing':     'opus',
  'complex-research': 'opus',
};

const DEFAULT_MODEL = 'sonnet';

function routeModel(taskType) {
  return ROUTES[taskType] || DEFAULT_MODEL;
}

The distribution tells the story: 11 task types on Haiku, 12 on Sonnet, 4 on Opus. The bulk of our system's daily inference — health checks, CRM updates, classification, data extraction, social posting — runs on the fastest, cheapest tier. The middle tier handles tasks that need real reasoning but aren't long-form creative work. Opus is reserved exclusively for high-stakes outputs: blog content, strategic planning, book drafting, and deep multi-source research.

How Services Consume It

Every service in the fleet imports the router identically:

const { routeModel, getTier } = require('./model-router');

// In the overnight research pipeline pipeline
const explorationModel = routeModel('exploration');  // → 'haiku'
const synthesisModel = routeModel('synthesis');       // → 'sonnet'

// In the autonomous playbook (9 AM daily)
const outreachModel = routeModel('cold-outreach');    // → 'sonnet'
const blogModel = routeModel('blog-post');            // → 'opus'

The getTier helper is used by our agent analytics service to track model distribution across the fleet:

function getTier(taskType) {
  const model = routeModel(taskType);
  if (model === 'haiku') return 1;
  if (model === 'sonnet') return 2;
  return 3;
}

This feeds into our observability dashboard, which shows real-time tier distribution across all 53 agents (23 on the VPS, 23 timer-triggered, and 7 cloud-deployed). On a typical day, roughly 65% of inference calls hit Tier 1, 28% hit Tier 2, and 7% hit Tier 3.

The Default Matters More Than You Think

const DEFAULT_MODEL = 'sonnet';

One line, but it encodes a critical philosophy: when in doubt, over-provision. If a new service makes an LLM call with an unregistered task type, it falls through to Sonnet — not Haiku. We'd rather pay slightly more on an unknown task than ship degraded output. Every time the default fires, it also shows up in our logs as an unrouted task, which tells us we need to add a new entry to the routing table.

Why Not Use a Classifier?

We considered building an ML classifier that would analyze the prompt content and dynamically select a tier. We rejected it for three reasons:

It's another model call. Routing a task to the right model by first calling a model to classify the task is architecturally absurd when you already know the task type at call time.
It's non-deterministic. The same task would sometimes route differently, making debugging impossible. With a static table, routeModel('synthesis') returns 'sonnet' every single time. No surprises at 2:00 AM.
The taxonomy already exists. Our services are purpose-built. The job scoring service knows it's doing classification. The research pipeline synthesizer knows it's doing synthesis. The task type is a property of the caller, not the content — and the caller already has it.

Consequences — What Actually Happened

What Worked

The overnight pipeline is significantly faster. Our five daily research pipeline each run an exploration phase (Haiku) and a synthesis phase (Sonnet). Exploration on Haiku completes in roughly a third of the time Sonnet would take for the same structured-output task. Since we run research pipeline in parallel batches of four, the per-batch wall-clock time dropped meaningfully when we stopped over-provisioning the exploration agents.

Quality is indistinguishable on tier-appropriate tasks. We ran a blind evaluation on 50 classification outputs (Haiku vs. Sonnet) and 30 data-extraction outputs (Haiku vs. Sonnet). Human evaluators couldn't reliably distinguish them. On synthesis and analysis tasks, however, Sonnet was consistently preferred over Haiku — confirming the tier boundary was in the right place.

Routing changes are trivially fast. When we added the contract task type for SOW generation, it was a single line added to the Sonnet tier. When we discovered that cold-outreach emails needed more nuance than Haiku could deliver (they were converting poorly), we moved it from Tier 1 to Tier 2 by changing one string. Total time from "this output quality is wrong" to "fix is deployed": under two minutes.

What We'd Do Differently

We should have started with metrics on day one. For the first few months, we tracked which models were called but not whether the routing was correct. We've since added quality sampling — a weekly cron that pulls random outputs from each tier and flags any that seem misrouted. We should have built that from the start.

Some task types straddle tiers. email-draft is on Haiku because most automated emails are simple. But occasionally a service generates an email that really needs Sonnet-level reasoning. We handle this with an optional override parameter — routeModel('email-draft', { override: 'sonnet' }) — but it's bolted on. A cleaner design would have supported per-invocation hints from the beginning.

Four Opus task types feels too few. We're probably under-utilizing Opus for some analysis and research tasks where the quality difference would matter. The instinct to save on cost can overcorrect toward cheaper models, and the quality gap between Sonnet and Opus on complex reasoning tasks is real.

When to Reconsider

This architecture has a shelf life. Here's when we'd redesign:

When model pricing changes dramatically. If Opus-class models drop to Sonnet pricing, the three-tier system collapses to two. If Haiku gets substantially more capable, the Sonnet tier shrinks. Every major model release, we re-evaluate the tier boundaries.

When we exceed ~100 task types. At 44 task types, a flat routing table is easy to scan. At 100+, we'd want a hierarchical system — route by domain first (content, operations, research, client), then by task within domain.

When task types become dynamic. Right now, every task type is known at development time. If we build a system where users define custom agent workflows with arbitrary task types, static routing breaks. That's when you need the classifier approach we rejected — but even then, we'd likely use a rule-based classifier with explicit tier boundaries, not a probabilistic one.

When quality scoring becomes cheap and fast. If someone ships a reliable, low-latency quality evaluator that can score an LLM output in under 200ms without an LLM call, the adaptive retry approach becomes viable again.

Conclusion

Model routing isn't glamorous infrastructure. It's a lookup table in a single JavaScript file. But that table encodes 44 decisions about where quality matters and where speed matters, and those decisions compound across hundreds of daily inference calls. The architecture is deliberately boring — a static map, a safe default, a three-tier hierarchy — because boring infrastructure is infrastructure that works at 2:00 AM when nobody's watching.

The pattern generalizes beyond our specific stack: know your task taxonomy, measure quality at each tier, and resist the temptation to throw your best model at everything. The best model for the job is the cheapest one that produces indistinguishable output.

Need help building AI agent systems or designing multi-agent architectures? Ledd Consulting specializes in autonomous workflow design and agent orchestration for enterprise teams.