Evaluation Is the Biggest Unsolved Problem in AI Engineering
We're doing psychometrics with unit-test tools. Why AI evals drift, saturate, and grade fluency instead of truth - and what partially works in production.
Key Takeaways
- AI evaluation is measurement science performed with testing tools: evals estimate latent properties of unenumerable distributions from finite samples with unreliable graders, yet teams build and trust them like deterministic test suites.
- Eval suites rot four ways - drift (your own success moves the question distribution), institutional overfitting (organizational decisions conditioned on the same cases are training on test), construct slippage, and saturation mistaken for excellence.
- LLM judges carry a structural fluency confound: they over-pass confident, well-textured wrongness - the exact failure mode that matters most - and using them as merge criteria breeds systems toward their blind spots.
- Measurement inversion governs the industry: the most valuable properties (calibration, drift resistance, correction acceptance, compounding) are longitudinal and hardest to measure, so investment flows to cheap static snapshots instead.