Evaluation Is the Biggest Unsolved Problem in AI Engineering

Q: Why do my eval scores look great while users complain?

Usually some combination of eval drift (your suite samples an outdated question distribution - often outdated because your own success pushed users toward harder questions), institutional overfitting (months of merge decisions conditioned on the same cases), and judge bias toward fluent answers. Check each: compare suite mix against a fresh production sample, score a virgin holdout, and audit judge grades against experts with source documents.

Q: What's wrong with LLM-as-judge evaluation?

Judges systematically over-pass fluent, confident, well-structured wrongness - exactly the most dangerous production failure - because they grade surface signals of competence. Worse, using a judge as a merge criterion breeds your system toward the judge's blind spots. Anchor correctness to artifacts (citations contain claims, numbers match tables) and reserve judges for properties they can actually see.

Q: What is institutional overfitting?

Overfitting performed by an organization instead of an optimizer: hundreds of accept/reject engineering decisions, each conditioned on the same eval set, cumulatively fit the system to those cases. No one "trained on test," yet the effect is identical. Detect it with a virgin holdout no decision ever touches - the gap between holdout and main-suite scores is your overfit.

Q: How do I evaluate long-term properties like memory or drift resistance?

Longitudinally, with dedicated probes: cohort curves for whether recurring task types improve over deployment time, scheduled correction tests (does a corrected fact ever resurface?), and drift probes (facts you know changed, re-asked monthly). Static suites structurally cannot measure time-derivative properties - budget for slow instruments.

Q: Are paired comparisons really more reliable than absolute scores?

Yes, substantially. Absolute scores carry every instrument bias at full strength; paired diffs under controlled change cancel shared bias, the way twin studies cancel shared genetics. Treat absolute numbers as reporting, and trust only differences for engineering decisions - ablations, swap tests, and regressions all live safely in diff-space.

Q: What additional detail does the "A Confession About Nine Essays" section cover?

The ablation that found 40% of our prompt was dead weight - measured by the eval suite. The swap test that valued architecture above a model generation - scored by the eval suite. The plateau at 81, the recovery to the low nineties, the 1.8-point downgrade that didn't hurt - all of it, numbers from the same instrument.

Q: What does this essay explain about the quarter the number lied?

The incident that earned evaluation its place as the series finale. Two quarters after the swap test, our dashboard was the best it had ever been: 91.2% and stable. The same quarter, customer escalations rose 40%.

Q: What additional detail does the "The Quarter the Number Lied" section cover?

Two quarters after the swap test, our dashboard was the best it had ever been: 91.2% and stable. The same quarter, customer escalations rose 40%.

Q: What does "The questions had moved" mean in this essay?

Our suite was built from a snapshot of production questions, curated fourteen months earlier.

Q: What does "The judge was grading the wrong property" mean in this essay?

A third of our suite used an LLM judge for open-ended answers.

Why do my eval scores look great while users complain?

What's wrong with LLM-as-judge evaluation?

What is institutional overfitting?

How do I evaluate long-term properties like memory or drift resistance?

Are paired comparisons really more reliable than absolute scores?

What does this essay explain about a confession about nine essays?

What additional detail does the "A Confession About Nine Essays" section cover?

What does this essay explain about the quarter the number lied?

What additional detail does the "The Quarter the Number Lied" section cover?

What does "The questions had moved" mean in this essay?

What does "The judge was grading the wrong property" mean in this essay?

What does this essay explain about testing tools, measurement problems?

What additional detail does the "Testing Tools, Measurement Problems" section cover?

What key claim does the essay make about testing tools, measurement problems?

What does this essay explain about the four ways an eval rots?

What additional detail does the "The Four Ways an Eval Rots" section cover?

What does "Eval drift is self-inflicted by success" mean in this essay?

What does "Saturation is mistaken for excellence" mean in this essay?

What does this essay explain about the fluency confound?

What additional detail does the "The Fluency Confound" section cover?

How should readers understand: The failure mode we most need to catch is the one optimized to fool the instrument we're catching it with?

What does this essay explain about measurement inversion?

What additional detail does the "Measurement Inversion" section cover?

How should readers understand: Measurement inversion: the value of a system property and the ease of measuring it are inversely correlated - and investment follows measurability, not value?

What Partially Works?

What additional detail does the "What Partially Works" section cover?

What does "Refresh the sampling frame on a schedule" mean in this essay?

What does "Keep a virgin holdout" mean in this essay?

What does "Grade against artifacts, not impressions" mean in this essay?

What does "Audit failures before averaging them" mean in this essay?

What does "Run paired diffs, not absolute scores" mean in this essay?

What does "Measure the longitudinal properties longitudinally" mean in this essay?

What are the limits of the argument presented in this essay?

What additional detail does the "Where This Argument Breaks" section cover?

What does ""Unsolved" overclaims for narrow tasks" mean in this essay?

What does "Production metrics are a real substitute - sometimes" mean in this essay?

What does "Judges will improve" mean in this essay?

What does "Perfect measurement isn't the goal" mean in this essay?

What does this essay explain about the question the whole series ends on?

What additional detail does the "The Question the Whole Series Ends On" section cover?

What should readers take away about ai evaluation is measurement science performed with testing tools: evals estimate latent properties of unenumerable distributions from finite samples with unreliable graders?

What should readers take away about eval suites rot four ways - drift (your own success moves the question distribution)?

What should readers take away about llm judges carry a structural fluency confound: they over-pass confident?

What should readers take away about measurement inversion governs the industry: the most valuable properties (calibration?

What should readers take away about partial remedies: refresh sampling frames monthly?

What should readers take away about the discipline-defining work of ai engineering is the same as every prior field's: constructing the instruments comes before trusting the progress?

What is Evaluation Is the Biggest Unsolved Problem in AI Engineering about?

Which series is Evaluation Is the Biggest Unsolved Problem in AI Engineering part of?

How does Evaluation Is the Biggest Unsolved Problem in AI Engineering relate to AI evals?

How does Evaluation Is the Biggest Unsolved Problem in AI Engineering relate to LLM as judge problems?

How does Evaluation Is the Biggest Unsolved Problem in AI Engineering relate to eval set design?

How does Evaluation Is the Biggest Unsolved Problem in AI Engineering relate to Goodhart's law AI?

How does Evaluation Is the Biggest Unsolved Problem in AI Engineering relate to benchmark overfitting?

How does Evaluation Is the Biggest Unsolved Problem in AI Engineering relate to RAG evaluation?

How does Evaluation Is the Biggest Unsolved Problem in AI Engineering relate to agent evaluation?

What is this writing piece about?

Who should read Evaluation Is the Biggest Unsolved Problem in AI Engineering?

What are the key takeaways from Evaluation Is the Biggest Unsolved Problem in AI Engineering?

How does Evaluation Is the Biggest Unsolved Problem in AI Engineering relate to Dhruvil Patel's work?

Evaluation Is the Biggest Unsolved Problem in AI Engineering

Key Takeaways

FAQ

Key Takeaways

FAQ

Related Expertise

Related Concepts

Related Research

Related Articles

Related Automations