A single execution of an evaluation benchmark against a specific model/prompt combination, with scored results.
An eval run is a single execution of an eval against one model and one prompt versionPrompt VersionAI & Machine LearningA version of a prompt templateView reference →, captured with enough detail to compare against the next run. A benchmark is a fixed instrument; the run is one reading from it, stamped with a date, a model build, and a score. Treating runs as disposable is how teams lose the ability to tell whether they improved or regressed.
The run-as-record discipline borrows directly from software testing. A test suiteTest SuiteQuality AssuranceA suite of related testsView reference → is the fixed thing; a test run is one invocation against one build, and continuous-integration culture taught teams to store every run so they can see the moment a green suite turned red. Applied to models, the same idea produces the offline eval: a suite of inputs with expected behaviour, executed before anything ships.
The hard part is scoring open-ended generations, where there is no single correct string. The answer that scaled was LLM-as-a-judge, formalised by Lianmin Zheng and colleagues in Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (2023). They showed that a strong judge model such as GPT-4 agrees with human preferences over 80% of the time, matching the agreement rate between humans themselves. That made automated, repeatable eval runs practical for subjective quality, and the same paper named the position and verbosity biases that a careless run inherits from its judge.
Practice then folded eval runs into the deploymentDeploymentEngineeringA deployment eventView reference → pipeline. The widely repeated rule in modern AI engineering is that prompts are code, which makes a regression eval the natural gate: run the suite on every prompt or model change, block the change if the score drops. The run stops being a one-off experimentExperimentValidationA test designed to validate a hypothesisView reference → and becomes a recurring check, like a unit-test jobJobUserJob To Be Done: what the user is trying to accomplishView reference →.
A team ships a summarisation featureFeatureProduct SpecificationA product capability or featureView reference → with a 120-example offline eval, each example paired with a reference summary and a judge rubric scoring faithfulness and concision. The baseline eval run on prompt version 4 scores 8.1 out of 10. An engineer rewrites the system prompt to cut verbosity and opens a pull request. The CI regression eval fires automatically: prompt version 5 scores 8.4 on concision but 7.2 on faithfulness, because the tighter prompt started dropping caveats. Two stored runs, side by side, make the trade-off legible, and the change is held until the faithfulness drop is fixed. Without the recorded baseline run, the regression would have reached users as a quiet quality decline.
In the Unified Product Graph, Eval RunAI & Machine LearningAn evaluation run against a benchmark lives in the eval_runai domain alongside AI ModelAI & Machine LearningAn AI or ML model used within the productView reference → and ai_modelEval BenchmarkAI & Machine LearningA benchmark for evaluating AI qualityView reference →. Its defining edge is eval_benchmarkEval Benchmarkexecuted asEval Runhierarchy, which ties every run back to the fixed benchmark that produced it, and eval_benchmark_executed_as_eval_runEval RunproducesMetriccausal, which records the scores that run emitted. Modelling the run as its own entity, distinct from the benchmark, is what lets the graph hold a benchmark's full history: every execution, against every model build, over time. That history is the raw material for the Model Evaluation Loop, where models are compared across runs rather than on a single snapshot.eval_run_produces_metric
Type-specific fields on BaseNode
run_datestringISO date executed
scorenumberAggregate score
passedbooleanWhether the passing threshold was met
duration_msnumberWall-clock duration (ms)
token_countnumberTotal tokens consumed
input_token_countnumberInput tokens
output_token_countnumberOutput tokens
costnumberTotal run cost
error_ratenumberPercentage of test cases that errored
feedback_scoresstringFeedback score summary (human or automated)
idstringrequiredUnique identifier (UUID)
typeNodeTyperequiredDiscriminator for the entity type
titlestringrequiredDisplay name
descriptionstringOptional detailed description
statusstringLifecycle status
tagsstring[]Freeform tags for filtering
4 phases — initial: planned
2 edge types connected to this entity.
eval_benchmark_executed_as_eval_runeval_run_produces_metric