What is the purpose of an Eval Run?

An eval run is one reading from a benchmark, stamped with a date, a model build, and a score. It produces the evidence that a prompt or model change actually improved quality, the experiment at the heart of AI development.

How do you use an Eval Run in product management?

Run evals automatically in CI when prompts change. Store results with the prompt version and model version. Compare runs over time to track quality trends.

Where does the concept of an Eval Run come from?

An eval run (a single scored execution of a model or prompt against a benchmark) is the LLM-era counterpart of a test-suite run, made routine by 2023–2024 evaluation tooling such as OpenAI Evals, LangSmith, and Weights & Biases. It has no single inventor; the pattern of capturing per-run scores so configurations can be compared over time is the shared practice these tools standardised.

What are common mistakes with an Eval Run?

Reporting an aggregate score with no breakdown by category hides which cases regressed behind a comfortable average. Running evals without pinning the model version, prompt and dataset makes results impossible to compare or reproduce. Treating a single run as ground truth ignores the variance inherent in non-deterministic models. Acting only on the headline number, with no review of the actual failing transcripts, fixes the metric rather than the behaviour.

🎟️

Evaluation Run

Q: What is an Eval Run?

A single execution of an evaluation against one model and one prompt version, captured with enough detail to compare against the next run.

Q: What is an example of an Eval Run?

Pre-deploy gate run: An eval run scores a new prompt version against the intent-accuracy benchmark: 91.4%, above the 90% bar, so the change is cleared to ship.

A single execution of an evaluation benchmark against a specific model/prompt combination, with scored results.

AI & Machine LearningEngineering & Platformtype: 'eval_run'interface: BaseNode

View in Graph

▼On this page

Description Properties Lifecycle Relationships Graph Position Related Entities

Description

An eval run is a single execution of an eval against one model and one prompt version, captured with enough detail to compare against the next run. A benchmark is a fixed instrument; the run is one reading from it, stamped with a date, a model build, and a score.

See moreSee less

Origin & evolution

The run-as-record discipline borrows directly from software testing. A test suite is the fixed thing; a test run is one invocation against one build, and continuous-integration culture taught teams to store every run so they can see the moment a green suite turned red. Applied to models, the same idea produces the offline eval: a suite of inputs with expected behaviour, executed before anything ships.

The hard part is scoring open-ended generations, where there is no single correct string. The answer that scaled was LLM-as-a-judge, formalised by Lianmin Zheng and colleagues in Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (2023). They showed that a strong judge model such as GPT-4 agrees with human preferences over 80% of the time, matching the agreement rate between humans themselves. That made automated, repeatable eval runs practical for subjective quality, and the same paper named the position and verbosity biases that a careless run inherits from its judge.

Chip Huyen's *Designing Machine Learning Systems* (O'Reilly, 2022) codifies what a run record should contain and why the discipline matters: she distinguishes experiment *tracking* — capturing loss curves, per-split metrics, inference speed, and sample-level predictions — from experiment *versioning*, which logs the configuration parameters that would be needed to reproduce the run. By that framing, an eval run is the intersection of both: a timestamped execution whose score is only meaningful if the model build and prompt config that produced it are stored alongside it. Huyen also argues that aggregate accuracy alone is insufficient for production readiness; she advocates perturbation testing (checking whether small input changes cause disproportionate score shifts) and slice-based evaluation (decomposing the aggregate score by data subgroup) as the tests that reveal whether a stored run's headline number is trustworthy.

Practice then folded eval runs into the deployment pipeline. The widely repeated rule in modern AI engineering is that prompts are code, which makes a regression eval the natural gate: run the suite on every prompt or model change, block the change if the score drops. The run stops being a one-off experiment and becomes a recurring check, like a unit-test job.

Huyen makes the production-gate logic explicit: she argues that the validation split — not the held-out test set — should bear the weight of iterative tuning, precisely so that the test run's score remains an uncontaminated reading against a fixed instrument. That separation maps directly onto the eval_run model: runs against the benchmark accumulate on the validation distribution during development; the benchmark itself stays fixed, and a final run against it functions as the release gate.

How it works in practice

A team ships a summarisation feature with a 120-example offline eval, each example paired with a reference summary and a judge rubric scoring faithfulness and concision. The baseline eval run on prompt version 4 scores 8.1 out of 10. An engineer rewrites the system prompt to cut verbosity and opens a pull request. The CI regression eval fires automatically: prompt version 5 scores 8.4 on concision but 7.2 on faithfulness, because the tighter prompt started dropping caveats. Two stored runs, side by side, make the trade-off legible, and the change is held until the faithfulness drop is fixed. Without the recorded baseline run, the regression would have reached users as a quiet quality decline.

Eval run vs. its neighbours

Eval benchmark is the reusable instrument: the dataset plus the scoring rule, held constant. The run is one execution of that instrument. You author a benchmark once and accumulate many runs against it.
Metric is the number a run produces. A run is the event of measuring; the metric is the measurement. One run can emit several metrics at once, which is why they are separate entities.
AI experiment is a structured investigation that may compare several runs to test a hypothesis. A run is the atomic execution an experiment is built from; the experiment is the question, the runs are the readings.

In the graph

In the Unified Product Graph, eval_run lives in the ai domain alongside ai_model and eval_benchmark. Its defining edge is eval_benchmark_executed_as_eval_run, which ties every run back to the fixed benchmark that produced it, and eval_run_produces_metric, which records the scores that run emitted. Modelling the run as its own entity, distinct from the benchmark, is what lets the graph hold a benchmark's full history: every execution, against every model build, over time. That history is the raw material for the Model Evaluation Loop, where models are compared across runs rather than on a single snapshot.

Preview

Presets

titlerun_datescore

passed

duration_mstoken_countinput_token_countoutput_token_countcosterror_ratefeedback_scores

Eval Run

Build accuracy benchmark run 2026-06-12

Passedtrue

Run date2026-06-12

Score88

Duration ms142000

Token count1840000

Input token count1420000

Output token count420000

Cost5.52

Error rate1.2

Feedback scores88% of 500 tool proposals rated usable without rework by human reviewers

Properties

Type-specific fields on BaseNode

run_datestring

ISO date executed

scorenumber

Aggregate score

passedboolean

Whether the passing threshold was met

duration_msnumber

Wall-clock duration (ms)

token_countnumber

Total tokens consumed

input_token_countnumber

Input tokens

output_token_countnumber

Output tokens

costnumber

Total run cost

Show all 10 properties

error_ratenumber

Percentage of test cases that errored

feedback_scoresstring

Feedback score summary (human or automated)

Inherited from BaseNode (6 fields)

idstringrequired

Unique identifier (UUID)

typeNodeTyperequired

Discriminator for the entity type

titlestringrequired

Display name

descriptionstring

Optional detailed description

statusstring

Lifecycle status

tagsstring[]

Freeform tags for filtering

Lifecycle

5 phases, initial: planned · template: STUDY

All lifecycles

Relationships

4 edge types connected to this entity.

Parents

Entities that can contain this type

Eval Benchmarkeval_benchmark_executed_as_eval_run

Cross-References

Contextual links across the graph

AI Modeleval_run_evaluates_ai_model

Prompt Versioneval_run_scores_prompt_version

Metriceval_run_produces_metric

Graph Position

1parent

🎟️Eval Run

3cross-ref

Definition

An evaluation run is a single execution of a benchmark against a specific model and prompt combination, with scored results recorded. It links back to its benchmark and produces the evidence that a prompt change improved or degraded quality.

Usage Guidance

Run evals automatically in CI when prompts change.
Store results with the prompt version and model version.
Compare runs over time to track quality trends.

Anti-Patterns

Reporting an aggregate score with no breakdown by category hides which cases regressed behind a comfortable average.
Running evals without pinning the model version, prompt and dataset makes results impossible to compare or reproduce.
Treating a single run as ground truth ignores the variance inherent in non-deterministic models.
Acting only on the headline number, with no review of the actual failing transcripts, fixes the metric rather than the behaviour.

Examples

Pre-deploy gate run

An eval run scores a new prompt version against the intent-accuracy benchmark: 91.4%, above the 90% bar, so the change is cleared to ship.

Regression catch

A scheduled eval run after a model upgrade shows faithfulness dropping from 0.88 to 0.81, flagging a regression before any users are affected.