Your AI isn't too weak. Your evals are missing.

The core of Trackist is an LLM that authors a personalized training week. A user answers about fifteen onboarding questions, the model writes one balanced week of workouts, and the app replicates that week across a training block. The output is free-form JSON, and it is non-deterministic: the same profile gives a slightly different plan every time. For a long while my only quality check was opening a few generated plans and deciding they looked fine.

"Looks fine" is not a measurable property. It does not scale, it is not repeatable, and it quietly misses the failures that matter most: the bug that shows up in one plan out of five is invisible when you glance at two. So I built an eval harness. It found a bug that no larger model could fix, flagged a problem that turned out to be its own mistake, and turned "which model should we ship?" from a vibe into a number with a dollar sign on it.

Here is what it caught.

Key takeaways

An eval turns "is this output good?" into a tracked number. For non-deterministic LLM output, manual spot-checks give false confidence and miss intermittent failures.
The highest-value fix was free. Every model failed one fixture with an invalid targetReps error, and no bigger model or higher effort fixed it. One line of prompt clarification did.
A bigger model is the expensive reflex, not the fix. Most failures live in the prompt and the output contract, not the model's raw capability, and a stronger model there only costs more and fails the same way. The eval is what tells you which.
One alarming finding was my own bug. A frequency check flagged "arms trained only once" in most plans; the plans were fine and the check was naive. Evals are code, and code has bugs.
Quality saturates earlier than you would guess. Sonnet 4.6 at high effort matched Opus on quality at roughly 40% of the cost and the same latency. The data picked the model.
Unit-style assertions are only level one. Human and model review, then A/B tests, sit above them at higher cost and lower frequency.

The reflex is to reach for a bigger model

When an AI feature underperforms, the first instinct is almost always to upgrade the model. It feels like the obvious lever: a bigger, newer model must do a better job. Sometimes it does. But a larger model is mostly better at predicting the next token, it is not a patch for everything wrapped around it, and most failures I have seen do not live in the model's capability at all. They live in the prompt, the output contract, the structure you ask for, the example you forgot to give. When the problem is there, a stronger model just burns more money per call and fails in exactly the same way.

That is not a hunch. It is the single most useful thing the eval proved, and the first story below is the cleanest example I have: a bug that Opus at maximum effort could not fix, closed by one line of prompt. Without the eval I would have done what most people do, reached for the bigger model, paid more, and still shipped the bug. The eval is what tells you whether your problem is the model or everything around it, so you spend on the lever that actually moves the metric instead of the one that feels powerful.

The cheap harness: split the expensive step from the cheap one

The trick that makes this affordable is separating the one expensive step from the many cheap ones. Generation costs real money because it makes paid API calls. Grading does not have to.

So the harness has two halves. pnpm eval:plans makes the real, paid API calls once and caches every generated plan to disk. pnpm test:functions then grades that cache for free, as many times as I like. I can rewrite a check, fix a bug in a check, or add a new rule, and re-grade thirty cached plans in seconds for zero dollars. I only pay again when I deliberately regenerate, usually to compare a different model on identical inputs.

The inputs are fixtures that mirror real onboarding answer-sets, not one happy-path profile. Full gym with four days. Bodyweight only with two days. A knee injury with a chest emphasis that avoids overhead pressing. A fat-loss goal where the only equipment available is machines and cables. The failures live in the corners, so the corners are where the fixtures live.

Grading happens in two layers. First, validation: a plan only counts if it passes the same schema the real app enforces, the structural and cross-field rules that make a week coherent. Invalid plans never reach the second layer. Then quality checks grade what validation cannot, the coaching judgment the prompt commits to, things like sensible exercise ordering and balanced volume across the week. I will keep those deliberately vague here, but each one is a small, explicit rule.

Writing the checks was its own payoff. Encoding "what makes a plan good" as code forced me to make coaching rules explicit that had only ever lived in my head, and it surfaced disagreements I had never resolved, like whether training arms twice a week should count the indirect work every press and pull already does. You cannot automate a definition of "good" you have not written down.

Evals come in three levels, and this is level two

It helps to be precise about what kind of eval this is, because "eval" gets used for very different things at very different costs.

Level	What it is	Runs	Cost
1 — Unit assertions	Fast, deterministic checks on output structure and rules	Every commit	Very low
2 — Human and model review	Systematic review and automated critique of quality	Weekly or biweekly	Medium
3 — A/B testing	Real-user experiments to measure business impact	Major releases	High

Level one is the cheap deterministic floor. It runs on every code change and answers "is this output even shaped right, and does it obey the hard rules?" That is the validation layer in my harness, and it costs almost nothing because it runs against the cache.

Level two is where I spent most of my effort, and where this post lives. It is the systematic review of quality, not just correctness. It is part automated critique, the coaching-quality checks, and part human judgment, me reading plans across models to decide which ones actually coach well. This is the level that judged plan quality across Haiku, Sonnet, and Opus, and it is the level that picked the model. It runs on a cadence, weekly or biweekly or before a model change, not on every commit, because regeneration costs money.

Level three is A/B testing: shipping two variants to real users and measuring whether the "better" plan actually changes behavior, like adherence or retention. That is the most expensive and the slowest, reserved for major releases, and it is the only level that measures business impact rather than a proxy for it. I have not run that for plan generation yet. It is the honest next step.

The mistake is treating level one as the whole story. Unit-style assertions tell you the JSON parses and the rules hold. They cannot tell you the coaching is smart. That gap is exactly what caught the two stories below.

Story one: the bug a bigger model could not fix

Every model failed the bodyweight fixture with the same error: invalid targetReps. The harness localized it precisely, to one field on one fixture, which is most of the work of fixing a bug.

The root cause was not the model. targetReps is a reps-only value by contract, enforced by the app's schema, matching a pattern like 8 or 8-12. But for a timed hold such as a plank, the model wrote a time target like "30-60s". That is reasonable coaching. The schema simply cannot express it, and the prompt never told the model how to write a timed hold within a reps-only field.

My instinct, and probably yours, was that a more capable model would get this right. It did not. Opus at maximum effort failed the exact same way. This was never a capability ceiling. It was a prompt gap, and no amount of model size or reasoning effort closes a gap in the instructions.

The fix was one line, telling the model that a 30 to 60 second plank is written "30-60" and the app supplies the unit. That took Sonnet at high effort from nine out of ten valid plans to a perfect ten, and held across six bodyweight samples on a later run.

The cheapest lever fixed what the most expensive lever could not. Reaching for a bigger model would have cost more per call and still failed. The eval is what revealed which lever actually moved the metric, instead of leaving me to guess that "use Opus" was the answer.

Story two: the finding that was actually my bug

A red eval is a hypothesis, not a verdict. A later run flagged something alarming in five of six plans, and my first instinct was that the model had gotten worse. The model was fine. My check was wrong.

The check was too strict. It tested for a rule the output was supposed to follow, but it did not account for a second, legitimate way the output already satisfied that rule, so plans that were actually correct were marked as failures. The fix was to the check, not to the prompt or the model. Whatever you build, this is the check that fires red because it does not know a rule your system legitimately follows, and it manufactures false failures that look exactly like real ones.

This is the uncomfortable part that does not appear in eval marketing. A failing eval is not automatically a model problem. Evals are code, your checks encode your assumptions, and those assumptions need review as much as the thing they evaluate. Before you act on a red result, confirm the check is right, otherwise you will "fix" a model that was never broken and degrade something that worked.

The model bake-off: quality saturates early

With the harness trustworthy, the model question became answerable with data instead of intuition. I ran the same fixtures through Haiku, Sonnet, and Opus and let the numbers talk.

Cost per generated plan

Haiku 4.5

$0.014Sonnet 4.6

$0.029Opus · high

$0.073Opus · med

$0.080

One balanced training week per call. Haiku and Opus figures predate the prompt fix above, so treat the small gaps as directional, not exact.

The full picture, with quality and latency alongside cost:

Model	Effort	Valid	Cost / plan	Latency	Real quality misses
Haiku 4.5	none¹	7/10	$0.014	12.4s	2 — wrong exercise order, broke a conditioning rule
Sonnet 4.6	high	10/10	$0.029	17.2s	low-rate, intermittent only
Opus 4.8	medium	8/10²	$0.080	23.3s	0
Opus 4.8	high	8/10²	$0.073	19.6s	0

¹ Haiku does not support an effort setting, so it runs without one. ² Opus's only invalids were the bodyweight targetReps slip the prompt fix later resolved.

Three things fell out of this.

Quality saturates earlier than intuition suggests. Sonnet at high effort reached zero real quality misses. Opus, at roughly two and a half times the cost and no faster, did not out-quality it. On these fixtures Opus actually scored lower on validity and bought nothing measurable. More model was not more quality.

Effort hits the same wall. Opus at medium effort matched Opus at high on every axis. Once a model has saturated the task, asking it to think harder changes nothing but the bill.

Haiku is the budget corner, and it shows. It is the cheapest and fastest by a wide margin, but it carried the lowest validity and two genuine quality misses: it ordered the exercises in a way the prompt explicitly forbids, and it added a block the prompt says to leave off on certain plan types. Both are rules the prompt commits to and the cheaper model broke. That is a legible budget-versus-quality trade, not a guess.

The pick was Sonnet 4.6 at high effort: the best validity in the set, quality level with Opus, the same latency, at well under half the cost. The eval did not just suggest that. It proved it on identical inputs.

What the choice is worth at scale

Per-plan numbers in fractions of a cent are easy to wave away. They stop being abstract when you multiply by a year of traffic. The chart below projects annual cost at an illustrative ten thousand plans a month. To be clear, that is a hypothetical volume to show how the per-plan gap scales, not a figure from Trackist's actual usage.

Annual cost at 10,000 plans / month (illustrative)

Haiku 4.5

$1.7kSonnet 4.6

$3.5kOpus · high

$8.7kOpus · med

$9.6k

Cost per plan × 10,000 × 12. Volumes shown are illustrative scenarios, not Trackist's real traffic.

The same gap across a few volumes:

Plans / month	Haiku	Sonnet 4.6	Opus · high	Opus · med
1,000	$169	$353	$872	$964
10,000	$1,692	$3,528	$8,724	$9,636
50,000	$8,460	$17,640	$43,620	$48,180

At ten thousand plans a month, choosing Sonnet over Opus at high effort saves about $5,200 a year, and over $6,000 against Opus at medium, for output the eval rates as equal or better. At fifty thousand a month that gap is roughly $26,000 to $30,000 a year. That is the difference between a defensible decision and an expensive habit, and before the eval I had no way to see it. I would have reached for the biggest model, paid the most, and shipped slightly lower validity.

Sample size cut both ways

One meta-lesson is worth its own section, because it surprised me. I started at two samples per fixture and later moved to six. Both the floor and the ceiling mattered.

At two samples, two real but intermittent quality misses were invisible: one plan type broke an ordering rule about a third of the time, and another occasionally violated a balance rule. Too few samples hid genuine, low-rate failures. At six samples, both surfaced as rates, roughly two in six each, which is the unit that actually tells you whether to act.

But more samples is not free, and small samples also hid my broken check. The false positive from story two only became obvious at six samples, because that is when it fired loudly enough to investigate. Rates, not single examples, are the unit of an eval. Two examples is an anecdote that can mislead in either direction. The cost is that every sample is a paid generation, so you buy confidence by the unit.

The honest limits

Evals are not magic, and pretending otherwise sets you up to trust the wrong number.

Checks encode your assumptions, and a naive or over-strict check manufactures false findings, as the arms case proved. Deterministic checks also judge proxies, not nuance: they verify structure, balance, and ordering, but they cannot tell you whether the coaching is genuinely smart or the exercise selection is inspired. Coverage is bounded by imagination, since an eval only catches the failure modes you thought to encode. And the fixtures are a hand-maintained copy of production vocabulary that can drift, which is why a drift guard now fails the run if a fixture references an exercise that no longer exists.

None of that is an argument against evals. It is an argument for treating the eval as a system you maintain, not a verdict you trust blindly. The honest next step here is level three: an A/B test that measures whether a higher-rated plan actually changes adherence, plus a model-as-judge pass for the nuance the deterministic checks miss.

An eval is a habit, not a one-time check

The most important shift is treating the eval as something you run constantly, not once. An AI feature is never finished. There is always another failure mode you have not encoded, another corner case a real user will find, another prompt tweak that quietly improves one thing and breaks another. That last part is the dangerous one. LLM output is non-deterministic and tightly coupled to the prompt, so a change that looks harmless, a reworded instruction, a new field, a model bump, can silently undo a fix you made weeks ago.

That is why the cheapest checks run on every commit. Once the prompt fix closed the targetReps bug, the validation check became a guard: any future change that reintroduces the bug fails immediately, instead of shipping and surfacing as a broken plan for a real user. The expensive quality and A/B layers run less often, but the deterministic floor is fast and free enough to run every single time, and it should. You improve an AI feature the way you improve any system, in small consistent steps, each one protected by a check that confirms the last improvement still holds. The goal is not to reach a perfect score once. It is to never quietly go backwards.

Why this matters if you build on LLMs

The general lessons travel well beyond workout plans:

You cannot improve what you cannot measure, and LLM output is non-deterministic, so manual spot-checks give false confidence.
Evals localize bugs. Pinning a failure to one fixture and one field pointed straight at the fix.
Prompt beats model size more often than you expect. The highest-return fix was free, and the bigger model would have cost more and still failed.
Model selection becomes data-driven: cost, latency, and quality on identical inputs, not a hunch that bigger is better.
Cheap deterministic checks on cached output let you iterate dozens of times for cents instead of re-running expensive generations.
Realistic, varied fixtures are essential, because a single happy-path profile hides the failures the corner cases expose immediately.

Before the harness, I was eyeballing a handful of plans and trusting my gut on the model. After it, I had a bug fixed for free that no model could fix, a check I corrected before it misled me, and a model choice backed by dollars and milliseconds. The score is not the point. The point is that every decision now has a reason behind it that I can show you, which is the entire difference between testing an AI feature and hoping it works.