The core of Trackist is an LLM that authors a personalized training week. A user answers about fifteen onboarding questions, the model writes one balanced week of workouts, and the app replicates that week across a training block. The output is free-form JSON, and it is non-deterministic: the same profile gives a slightly different plan every time. For a long while my only quality check was opening a few generated plans and deciding they looked fine.
"Looks fine" is not a measurable property. It does not scale, it is not repeatable, and it quietly misses the failures that matter most: the bug that shows up in one plan out of five is invisible when you glance at two. So I built an eval harness. It found a bug that no larger model could fix, flagged a problem that turned out to be its own mistake, and turned "which model should we ship?" from a vibe into a number with a dollar sign on it.
Here is what it caught.
The reflex is to reach for a bigger model
When an AI feature underperforms, the first instinct is almost always to upgrade the model. It feels like the obvious lever: a bigger, newer model must do a better job. Sometimes it does. But a larger model is mostly better at predicting the next token, it is not a patch for everything wrapped around it, and most failures I have seen do not live in the model's capability at all. They live in the prompt, the output contract, the structure you ask for, the example you forgot to give. When the problem is there, a stronger model just burns more money per call and fails in exactly the same way.
That is not a hunch. It is the single most useful thing the eval proved, and the first story below is the cleanest example I have: a bug that Opus at maximum effort could not fix, closed by one line of prompt. Without the eval I would have done what most people do, reached for the bigger model, paid more, and still shipped the bug. The eval is what tells you whether your problem is the model or everything around it, so you spend on the lever that actually moves the metric instead of the one that feels powerful.
The cheap harness: split the expensive step from the cheap one
The trick that makes this affordable is separating the one expensive step from the many cheap ones. Generation costs real money because it makes paid API calls. Grading does not have to.
So the harness has two halves. pnpm eval:plans makes the real, paid API calls once and caches every generated plan to disk. pnpm test:functions then grades that cache for free, as many times as I like. I can rewrite a check, fix a bug in a check, or add a new rule, and re-grade thirty cached plans in seconds for zero dollars. I only pay again when I deliberately regenerate, usually to compare a different model on identical inputs.
The inputs are fixtures that mirror real onboarding answer-sets, not one happy-path profile. Full gym with four days. Bodyweight only with two days. A knee injury with a chest emphasis that avoids overhead pressing. A fat-loss goal where the only equipment available is machines and cables. The failures live in the corners, so the corners are where the fixtures live.
Grading happens in two layers. First, validation: a plan only counts if it passes the same schema the real app enforces, the structural and cross-field rules that make a week coherent. Invalid plans never reach the second layer. Then quality checks grade what validation cannot, the coaching judgment the prompt commits to, things like sensible exercise ordering and balanced volume across the week. I will keep those deliberately vague here, but each one is a small, explicit rule.
Writing the checks was its own payoff. Encoding "what makes a plan good" as code forced me to make coaching rules explicit that had only ever lived in my head, and it surfaced disagreements I had never resolved, like whether training arms twice a week should count the indirect work every press and pull already does. You cannot automate a definition of "good" you have not written down.
Evals come in three levels, and this is level two
It helps to be precise about what kind of eval this is, because "eval" gets used for very different things at very different costs.
| Level | What it is | Runs | Cost |
|---|---|---|---|
| 1 — Unit assertions | Fast, deterministic checks on output structure and rules | Every commit | Very low |
| 2 — Human and model review | Systematic review and automated critique of quality | Weekly or biweekly | Medium |
| 3 — A/B testing | Real-user experiments to measure business impact | Major releases | High |
Level one is the cheap deterministic floor. It runs on every code change and answers "is this output even shaped right, and does it obey the hard rules?" That is the validation layer in my harness, and it costs almost nothing because it runs against the cache.
Level two is where I spent most of my effort, and where this post lives. It is the systematic review of quality, not just correctness. It is part automated critique, the coaching-quality checks, and part human judgment, me reading plans across models to decide which ones actually coach well. This is the level that judged plan quality across Haiku, Sonnet, and Opus, and it is the level that picked the model. It runs on a cadence, weekly or biweekly or before a model change, not on every commit, because regeneration costs money.
Level three is A/B testing: shipping two variants to real users and measuring whether the "better" plan actually changes behavior, like adherence or retention. That is the most expensive and the slowest, reserved for major releases, and it is the only level that measures business impact rather than a proxy for it. I have not run that for plan generation yet. It is the honest next step.
The mistake is treating level one as the whole story. Unit-style assertions tell you the JSON parses and the rules hold. They cannot tell you the coaching is smart. That gap is exactly what caught the two stories below.
Story one: the bug a bigger model could not fix
Every model failed the bodyweight fixture with the same error: invalid targetReps. The harness localized it precisely, to one field on one fixture, which is most of the work of fixing a bug.
The root cause was not the model. targetReps is a reps-only value by contract, enforced by the app's schema, matching a pattern like 8 or 8-12. But for a timed hold such as a plank, the model wrote a time target like "30-60s". That is reasonable coaching. The schema simply cannot express it, and the prompt never told the model how to write a timed hold within a reps-only field.
My instinct, and probably yours, was that a more capable model would get this right. It did not. Opus at maximum effort failed the exact same way. This was never a capability ceiling. It was a prompt gap, and no amount of model size or reasoning effort closes a gap in the instructions.
The fix was one line, telling the model that a 30 to 60 second plank is written "30-60" and the app supplies the unit. That took Sonnet at high effort from nine out of ten valid plans to a perfect ten, and held across six bodyweight samples on a later run.
The cheapest lever fixed what the most expensive lever could not. Reaching for a bigger model would have cost more per call and still failed. The eval is what revealed which lever actually moved the metric, instead of leaving me to guess that "use Opus" was the answer.
Story two: the finding that was actually my bug
A red eval is a hypothesis, not a verdict. A later run flagged something alarming in five of six plans, and my first instinct was that the model had gotten worse. The model was fine. My check was wrong.
The check was too strict. It tested for a rule the output was supposed to follow, but it did not account for a second, legitimate way the output already satisfied that rule, so plans that were actually correct were marked as failures. The fix was to the check, not to the prompt or the model. Whatever you build, this is the check that fires red because it does not know a rule your system legitimately follows, and it manufactures false failures that look exactly like real ones.
This is the uncomfortable part that does not appear in eval marketing. A failing eval is not automatically a model problem. Evals are code, your checks encode your assumptions, and those assumptions need review as much as the thing they evaluate. Before you act on a red result, confirm the check is right, otherwise you will "fix" a model that was never broken and degrade something that worked.
The model bake-off: quality saturates early
With the harness trustworthy, the model question became answerable with data instead of intuition. I ran the same fixtures through Haiku, Sonnet, and Opus and let the numbers talk.
The full picture, with quality and latency alongside cost:
| Model | Effort | Valid | Cost / plan | Latency | Real quality misses |
|---|---|---|---|---|---|
| Haiku 4.5 | none¹ | 7/10 | $0.014 | 12.4s | 2 — wrong exercise order, broke a conditioning rule |
| Sonnet 4.6 | high | 10/10 | $0.029 | 17.2s | low-rate, intermittent only |
| Opus 4.8 | medium | 8/10² | $0.080 | 23.3s | 0 |
| Opus 4.8 | high | 8/10² | $0.073 | 19.6s | 0 |
¹ Haiku does not support an effort setting, so it runs without one. ² Opus's only invalids were the bodyweight targetReps slip the prompt fix later resolved.
Three things fell out of this.
Quality saturates earlier than intuition suggests. Sonnet at high effort reached zero real quality misses. Opus, at roughly two and a half times the cost and no faster, did not out-quality it. On these fixtures Opus actually scored lower on validity and bought nothing measurable. More model was not more quality.
Effort hits the same wall. Opus at medium effort matched Opus at high on every axis. Once a model has saturated the task, asking it to think harder changes nothing but the bill.
Haiku is the budget corner, and it shows. It is the cheapest and fastest by a wide margin, but it carried the lowest validity and two genuine quality misses: it ordered the exercises in a way the prompt explicitly forbids, and it added a block the prompt says to leave off on certain plan types. Both are rules the prompt commits to and the cheaper model broke. That is a legible budget-versus-quality trade, not a guess.
The pick was Sonnet 4.6 at high effort: the best validity in the set, quality level with Opus, the same latency, at well under half the cost. The eval did not just suggest that. It proved it on identical inputs.
What the choice is worth at scale
Per-plan numbers in fractions of a cent are easy to wave away. They stop being abstract when you multiply by a year of traffic. The chart below projects annual cost at an illustrative ten thousand plans a month. To be clear, that is a hypothetical volume to show how the per-plan gap scales, not a figure from Trackist's actual usage.
The same gap across a few volumes:
| Plans / month | Haiku | Sonnet 4.6 | Opus · high | Opus · med |
|---|---|---|---|---|
| 1,000 | $169 | $353 | $872 | $964 |
| 10,000 | $1,692 | $3,528 | $8,724 | $9,636 |
| 50,000 | $8,460 | $17,640 | $43,620 | $48,180 |
At ten thousand plans a month, choosing Sonnet over Opus at high effort saves about $5,200 a year, and over $6,000 against Opus at medium, for output the eval rates as equal or better. At fifty thousand a month that gap is roughly $26,000 to $30,000 a year. That is the difference between a defensible decision and an expensive habit, and before the eval I had no way to see it. I would have reached for the biggest model, paid the most, and shipped slightly lower validity.
Sample size cut both ways
One meta-lesson is worth its own section, because it surprised me. I started at two samples per fixture and later moved to six. Both the floor and the ceiling mattered.
At two samples, two real but intermittent quality misses were invisible: one plan type broke an ordering rule about a third of the time, and another occasionally violated a balance rule. Too few samples hid genuine, low-rate failures. At six samples, both surfaced as rates, roughly two in six each, which is the unit that actually tells you whether to act.
But more samples is not free, and small samples also hid my broken check. The false positive from story two only became obvious at six samples, because that is when it fired loudly enough to investigate. Rates, not single examples, are the unit of an eval. Two examples is an anecdote that can mislead in either direction. The cost is that every sample is a paid generation, so you buy confidence by the unit.
The honest limits
Evals are not magic, and pretending otherwise sets you up to trust the wrong number.
Checks encode your assumptions, and a naive or over-strict check manufactures false findings, as the arms case proved. Deterministic checks also judge proxies, not nuance: they verify structure, balance, and ordering, but they cannot tell you whether the coaching is genuinely smart or the exercise selection is inspired. Coverage is bounded by imagination, since an eval only catches the failure modes you thought to encode. And the fixtures are a hand-maintained copy of production vocabulary that can drift, which is why a drift guard now fails the run if a fixture references an exercise that no longer exists.
None of that is an argument against evals. It is an argument for treating the eval as a system you maintain, not a verdict you trust blindly. The honest next step here is level three: an A/B test that measures whether a higher-rated plan actually changes adherence, plus a model-as-judge pass for the nuance the deterministic checks miss.
An eval is a habit, not a one-time check
The most important shift is treating the eval as something you run constantly, not once. An AI feature is never finished. There is always another failure mode you have not encoded, another corner case a real user will find, another prompt tweak that quietly improves one thing and breaks another. That last part is the dangerous one. LLM output is non-deterministic and tightly coupled to the prompt, so a change that looks harmless, a reworded instruction, a new field, a model bump, can silently undo a fix you made weeks ago.
That is why the cheapest checks run on every commit. Once the prompt fix closed the targetReps bug, the validation check became a guard: any future change that reintroduces the bug fails immediately, instead of shipping and surfacing as a broken plan for a real user. The expensive quality and A/B layers run less often, but the deterministic floor is fast and free enough to run every single time, and it should. You improve an AI feature the way you improve any system, in small consistent steps, each one protected by a check that confirms the last improvement still holds. The goal is not to reach a perfect score once. It is to never quietly go backwards.
Why this matters if you build on LLMs
The general lessons travel well beyond workout plans:
- You cannot improve what you cannot measure, and LLM output is non-deterministic, so manual spot-checks give false confidence.
- Evals localize bugs. Pinning a failure to one fixture and one field pointed straight at the fix.
- Prompt beats model size more often than you expect. The highest-return fix was free, and the bigger model would have cost more and still failed.
- Model selection becomes data-driven: cost, latency, and quality on identical inputs, not a hunch that bigger is better.
- Cheap deterministic checks on cached output let you iterate dozens of times for cents instead of re-running expensive generations.
- Realistic, varied fixtures are essential, because a single happy-path profile hides the failures the corner cases expose immediately.
Before the harness, I was eyeballing a handful of plans and trusting my gut on the model. After it, I had a bug fixed for free that no model could fix, a check I corrected before it misled me, and a model choice backed by dollars and milliseconds. The score is not the point. The point is that every decision now has a reason behind it that I can show you, which is the entire difference between testing an AI feature and hoping it works.
