A set of test cases and scoring criteria used to evaluate AI model or prompt performance.
An eval benchmark is a fixed, shared set of tasksTaskProduct SpecificationA unit of work within a story or epicView reference → and a scoring rule, used to measure what a model can do and to compare models on the same footing. The discipline rests on one fragile assumptionAssumptionStrategyA belief taken as true that underpins a strategyView reference →: that the model has never seen the test before. When that assumption breaks, the number still looks precise, and that is exactly the danger.
Shared benchmarks predate large language models, but the modern lineage starts with GLUE in 2018, which bundled nine natural-language-understanding tasks into one score. Models caught up within a year, so SuperGLUE arrived in 2019 with harder tasks and a stronger human baseline. The pattern repeats across the field: a benchmark is published, the frontier climbs, the benchmark saturates, a tougher one replaces it.
MMLU, released by Dan Hendrycks and colleagues on 7 September 2020, was a direct response to that saturation. It posed 57 subjects of multiple-choice questions, from elementary mathematics to law, to test breadth rather than a single skill. It became one of the most cited benchmarks in the field, with over 100 million dataset downloads by mid-2024.
Standardisation became its own concern. Stanford's HELM project argued that a benchmark is only meaningful if the prompts, the scoring, and the raw predictions are held constant and published. Two labs running "MMLU" with different prompt formats produce different numbers, so HELM fixed the harness and exposed every prompt.
Then came the contamination reckoning. Because public benchmarks leak into pretraining corpora, a high score can reflect memorisation. Scale AI built GSM1k, a held-out mirror of the GSM8K grade-school maths set, and found accuracy drops of up to 8% on some model families, with the Phi and Mistral families showing systematic overfitting while frontier models held steady. They released only 50 of the 1,000 problems to keep the rest clean.
A team picks Claude for a customer-support assistant after seeing it lead an MMLU leaderboard. In production, the assistant keeps misreading refund-policy edge cases. The public benchmark measured broad academic knowledge; it never measured this product's refund logic. So the team builds a task-specific eval benchmark: 300 real support ticketsSupport TicketCustomer SuccessCustomer support request or issueView reference →, each with a graded correct resolution, scored on a rubric the support leads agree on. The first run scores 71%. The MMLU number, impressive as it was, predicted almost nothing about the 29% that fail here. The gap between a benchmark score and production behaviour is the thing the team now manages directly.
In the Unified Product Graph, Eval BenchmarkAI & Machine LearningA benchmark for evaluating AI quality lives in the eval_benchmarkai domain, where AI ModelAI & Machine LearningAn AI or ML model used within the productView reference → is the anchor. It sits at the centre of the Model Evaluation Loop pattern: a model is ai_modelAI Modelbenchmarked byEval Benchmarkhierarchy, that benchmark is ai_model_benchmarked_by_eval_benchmarkEval Benchmarkexecuted asEval Runhierarchy each time it runs, and the benchmark connects through eval_benchmark_executed_as_eval_runEval BenchmarkmeasuresFeaturecross-domain and eval_benchmark_measures_featureEval BenchmarkdefinesMetrichierarchy to the product capabilityCapabilityStrategyAn ability that enables value deliveryView reference → under test and the metric that scores it. That structure encodes the domain's first anti-pattern, models without evaluation, as a visible gap: a deployed model with no benchmark edge is queryably untested.eval_benchmark_defines_metric
Type-specific fields on BaseNode
benchmark_typestringMeasured dimension
test_case_countnumberTest cases in the suite
passing_thresholdnumberMinimum passing score
last_runstringISO date of the most recent run
idstringrequiredUnique identifier (UUID)
typeNodeTyperequiredDiscriminator for the entity type
titlestringrequiredDisplay name
descriptionstringOptional detailed description
statusstringLifecycle status
tagsstring[]Freeform tags for filtering
4 phases — initial: drafted
4 edge types connected to this entity.
ai_model_benchmarked_by_eval_benchmarkeval_benchmark_measures_feature