Engineering Notes

Designing a Multi-Model LLM Harness

The frontier moves every few weeks, and no single model wins every task. The practical answer is not to pick one — it is to build a thin layer that can call any of them safely, route work to the right one, and keep getting better. Here is the architecture pattern, model-agnostic and generic.

The mental model

You make one call with either a model or a job type. The harness picks a model, runs it through a fixed set of rails, logs everything, and falls back to the next candidate if one fails. Application code never calls a provider directly — there is one door.

ask("summarize this", task="cheap_bulk")     # route by job + fallback
ask("review my plan", model="planner")        # explicit model, still governed

The rails — what every call passes through

In order. Each one exists to prevent a specific failure that bites at scale.

1
Cache
Seen this exact prompt before? Return the stored answer — free, and retries become idempotent.
2
Budget
Is this provider over its daily limit? Metered models get a dollar cap; subscription models get an estimated-token cap. Over budget → skip to the next candidate.
3
Circuit breaker
Has this provider failed several times recently? Skip it fast for a cooldown instead of paying the full timeout on every call — otherwise one outage cascades across the whole fleet.
4
Concurrency slot
Too many calls in flight at once? Wait for a slot. A cross-process limiter stops overlapping jobs from spawning a subprocess storm.
5
Call
Run the model. On success: reset the breaker, store the cache entry, log cost + latency. On failure: trip the breaker and fall back to the next model.
6
Verify
Optionally check the output (schema valid? test passes? a second model agrees?) and record a quality verdict — turning "it ran" into "it was correct."

Six principles

One door, not ten
Every model — local or hosted, subscription or metered — sits behind a single call. Application code never talks to a specific provider directly. Swapping or adding a model is a one-line change in a table, not a refactor across the codebase.
Route by the job, not the model
Callers say what they are doing ("cheap bulk", "hard reasoning", "adversarial review"), not which model to use. A routing table maps each job to an ordered list of models. Pin a specific model only as an explicit, still-governed override.
Fall back, and label it
Each job is a chain. If the first model is down, over budget, or errors, the next one runs and the result is flagged "degraded" so downstream consumers know quality dropped. A fallback that silently masks a problem is worse than a failure.
Spend on leverage, not routine
Free subscription models lead high-value work; the metered budget is spent where cheap parallelism creates leverage (bulk extraction, fan-out) — preserving scarce quota for the work that needs it. Reserve premium models for high-stakes verification, never for always-on volume.
Measure quality, then tune by hand
Log every call with cost, latency, and — where it matters — a correctness verdict. Read the rollup periodically and move each job toward the model that actually scores best on your tasks. This beats both gut feel and a premature auto-tuner that has no reliable signal to learn from yet.
Diversity beats redundancy in review
When one model checks another, force them to be from different labs. Same-family review shares the same blind spots; a different architecture catches errors the author could not see. One model drafts; a different-family model critiques.

The quality loop

The reason most multi-model setups never improve is that they log whether a call ran, not whether it was right. Close that gap: attach a verifier to the outputs that matter, record an accept/reject verdict, and periodically read a rollup of accept-rate per job and model. Then move each job toward the winner. This is a deliberate, measured substitute for an automatic router — and it produces exactly the labeled data an automatic router would eventually need.

job / model            judged  good  accept%
structured_extract A      12     11      92%
code_edit B                8      5      62%   (candidate to demote)

Five traps to avoid

✕ Build an ML auto-router on day one
Use a static routing table and tune it from real quality data. An optimizer needs a trustworthy reward signal you will not have until you have collected it.
✕ Adopt a heavy orchestration framework
A few hundred lines, the standard library, and a small test suite scale surprisingly far for a single operator. Add weight only when a real limit forces it.
✕ Trust "it returned" as success
A call that runs is not a call that was correct. Attach verification to the outputs that matter and record the verdict.
✕ Share one global spend cap across everything
Give each caller its own sub-budget so one runaway loop cannot starve the rest of the system for the day.
✕ Pick the "best" model by reputation
Compare candidates on your own prompts — cost, latency, and quality side by side. The right model is the one that wins on your tasks, not the leaderboard.

Why it holds up

Models will keep changing; the harness should not have to. By separating what you want done from which model does it, and by wrapping every call in budget, resilience, and observability rails, you get a layer that absorbs the churn. New model drops? Add a row. A provider degrades? The breaker and fallback route around it. Costs creep? The ledger shows exactly where. The system stays small, stays measurable, and keeps improving from its own data — which is the whole point.

A generalized write-up of patterns I use day to day. Architecture only — no implementation specifics.