What is an AI Experiment?

A structured trial that changes one variable in an AI system and measures its effect on quality, cost, or latency.

What is the purpose of an AI Experiment?

An AI experiment isolates a single change, such as a model, prompt, or parameter, and scores it against a fixed dataset and agreed metrics. It lets a team answer "is this version actually better?" with evidence rather than a hunch from a few lucky demos.

How do you use an AI Experiment in product management?

Change one thing at a time and hold the dataset and eval constant, or you won't know what moved the metric. State the hypothesis and the success criterion before you run. Log cost and latency alongside quality. A model that is two points better but three times slower is rarely the right call.

Where does the concept of an AI Experiment come from?

The discipline is offline evaluation, imported from machine learning and adapted to the open-ended outputs of language models, where generated prose resists exact-match grading. The field's answer was LLM-as-a-judge, using a strong model to score another model's output against a reference or rubric, the loop now described by Evidently AI and LangSmith, with mature practice keeping a held-out slice of labelled examples to validate the judge itself.

What are common mistakes with an AI Experiment?

Eyeballing a handful of outputs instead of running a fixed evaluation set turns a vibe check into a decision and rewards whichever variant happened to demo well. Changing the prompt, model, and retrieval at once leaves you unable to attribute any improvement. Optimising a single automated score while ignoring qualitative failure modes ships a model that benchmarks well and behaves badly on the cases users actually hit.

👾

AI Experiment

Q: What is an example of an AI Experiment?

Cheaper model, same quality?: Swap the summarisation model from a large to a mid-tier one, holding the prompt and the 200-document golden set fixed. Result: 1.5-point drop in quality score, 4x lower cost, accepted.

A controlled trial varying a model, prompt, or parameter and measuring the effect.

AI & Machine LearningEngineering & Platformtype: 'ai_experiment'interface: BaseNode

View in Graph

▼On this page

Description Properties Lifecycle Relationships Graph Position Related Entities

Description

An AI experiment is a structured comparison: one or more models, prompts, or configurations run against a fixed dataset and scored on agreed metrics. It is how a team answers "is this version actually better?" with evidence instead of a hunch from three lucky demos.

See moreSee less

Origin & evolution

The discipline is offline evaluation, imported from machine learning and adapted to the open-ended outputs of language models. Where a classifier could be scored against labelled ground truth, generated prose resists exact-match grading. The field's answer was LLM-as-a-judge: use a strong model to score another model's output against a reference or a rubric. Evidently AI and LangSmith both describe the now-standard loop, where an experiment runs a candidate system over a curated dataset and a judge or metric produces comparable scores. Mature practice keeps a held-out slice of manually labelled examples to validate the judge itself, since an unchecked judge inherits its own biases.

How it works in practice

A team wants to swap a prompt that is costing too many tokens. They assemble a dataset of 200 real support questions with hand-written ideal answers. They run two configurations: the live prompt and a leaner candidate. An LLM judge scores each answer for accuracy and helpfulness; a script measures token use. The candidate scores 4.4 against the live prompt's 4.5 on quality, a difference inside the noise, while cutting tokens by 38%. The experiment makes the trade-off legible, and the team ships the cheaper prompt with confidence rather than crossing their fingers in production.

Anthropic's guidance on agentic systems rests on the same logic: start with simple prompts, optimize them with comprehensive evaluation, and add multi-step orchestration only when simpler solutions demonstrably fall short. By that position, the experiment is not a later-stage quality check but the gate that justifies every increase in system complexity.

AI Experiment vs. its neighbours

AI Trace records what one production call actually did. An experiment runs many controlled calls against a known dataset before anything reaches users. The trace is observation in the wild; the experiment is a lab test.
Eval Benchmark is the standardised yardstick (a public test set and metric) that an experiment can run against. The benchmark is the ruler; the experiment is the act of measuring your system with it.
AI Dataset is the fixed input an experiment runs over. The dataset is the question set; the experiment is the graded sitting.

In the graph

In the Unified Product Graph, an AI Experiment lives in the AI region as the evaluation primitive. The model under test connects through ai_model_evaluated_through_ai_experiment, and any models the experiment invokes (a judge, a baseline) connect through ai_experiment_uses_ai_model. Keeping the experiment as its own node means a model's quality claims point at the evidence that backs them, so "we improved accuracy" is always traceable to the run that demonstrated it.

Preview

Presets

titleprojectrun_nameconfigsummary_metricsstarted_atcompleted_attraining_stepsartifact_urinotestags

AI Experiment

Safe Change explainability prompt experiment: structured vs freeform output

ProjectSafe Change trust uplift

Run namestructured-vs-freeform-v3

Config{"prompt_version":"1.4.0","output_format":"structured","temperature":0,"max_tokens":512}

Summary metricsapproval rate 91% structured vs 83% freeform; director edit rate 5% vs 14%

Started at2026-05-20T09:00:00Z

Completed at2026-05-20T11:42:00Z

Training steps0

Artifact uris3://trellis-ml/experiments/safe-change-structured-v3/results.json

NotesStructured JSON output with labelled risk level reduced director editing substantially

Tagssafe-change, explainability, prompt-engineering

Properties

Type-specific fields on BaseNode

projectstring

Parent project or experiment group

run_namestring

Human-readable run name

configstring

Serialised hyperparameters and config

summary_metricsstring

Key-metric summary

started_atstring

ISO timestamp started

completed_atstring

ISO timestamp completed

training_stepsnumber

Training steps or epochs completed

artifact_uristring

Produced artifact URI

Show all 10 properties

notesstring

Free-text notes

tagsstring[]

Free-form classification tags

Inherited from BaseNode (6 fields)

idstringrequired

Unique identifier (UUID)

typeNodeTyperequired

Discriminator for the entity type

titlestringrequired

Display name

descriptionstring

Optional detailed description

statusstring

Lifecycle status

tagsstring[]

Freeform tags for filtering

Lifecycle

5 phases, initial: planned · template: STUDY

All lifecycles

Relationships

3 edge types connected to this entity.

Parents

Entities that can contain this type

AI Modelai_model_evaluated_through_ai_experiment

Cross-References

Contextual links across the graph

AI Modelai_experiment_based_on_ai_model

AI Modelai_experiment_uses_ai_model

Graph Position

1parent

👾AI Experiment

2cross-ref

Definition

An AI experiment changes one variable in an AI system, such as a model, prompt, or parameter, and measures the effect on quality, cost, or latency. It runs against a dataset, is scored by an eval run, and produces the comparison that justifies what ships.

Usage Guidance

Change one thing at a time and hold the dataset and eval constant, or you won't know what moved the metric.
State the hypothesis and the success criterion before you run.
Log cost and latency alongside quality.
A model that is two points better but three times slower is rarely the right call.

Anti-Patterns

Eyeballing a handful of outputs instead of running a fixed evaluation set turns a vibe check into a decision and rewards whichever variant happened to demo well.
Changing the prompt, model, and retrieval at once leaves you unable to attribute any improvement.
Optimising a single automated score while ignoring qualitative failure modes ships a model that benchmarks well and behaves badly on the cases users actually hit.

Examples

Cheaper model, same quality?

Swap the summarisation model from a large to a mid-tier one, holding the prompt and the 200-document golden set fixed. Result: 1.5-point drop in quality score, 4x lower cost, accepted.

Few-shot vs zero-shot routing

Compare a zero-shot classifier prompt against a 5-example few-shot version on the same intent dataset. Few-shot lifts accuracy from 86% to 92% at a small token-cost increase.

AI Experiment

A controlled trial varying a model, prompt, or parameter and measuring the effect.

AI & Machine LearningEngineering & Platformtype: 'ai_experiment'interface: BaseNode

View in Graph

▼On this page

Description Properties Lifecycle Relationships Graph Position Related Entities

Description

See moreSee less

Origin & evolution

How it works in practice

AI Experiment vs. its neighbours

AI Trace records what one production call actually did. An experiment runs many controlled calls against a known dataset before anything reaches users. The trace is observation in the wild; the experiment is a lab test.
Eval Benchmark is the standardised yardstick (a public test set and metric) that an experiment can run against. The benchmark is the ruler; the experiment is the act of measuring your system with it.
AI Dataset is the fixed input an experiment runs over. The dataset is the question set; the experiment is the graded sitting.

In the graph

Preview

Presets

titleprojectrun_nameconfigsummary_metricsstarted_atcompleted_attraining_stepsartifact_urinotestags

AI Experiment

Safe Change explainability prompt experiment: structured vs freeform output

ProjectSafe Change trust uplift

Run namestructured-vs-freeform-v3

Config{"prompt_version":"1.4.0","output_format":"structured","temperature":0,"max_tokens":512}

Summary metricsapproval rate 91% structured vs 83% freeform; director edit rate 5% vs 14%

Started at2026-05-20T09:00:00Z

Completed at2026-05-20T11:42:00Z

Training steps0

Artifact uris3://trellis-ml/experiments/safe-change-structured-v3/results.json

NotesStructured JSON output with labelled risk level reduced director editing substantially

Tagssafe-change, explainability, prompt-engineering

Properties

Type-specific fields on BaseNode

projectstring

Parent project or experiment group

run_namestring

Human-readable run name

configstring

Serialised hyperparameters and config

summary_metricsstring

Key-metric summary

started_atstring

ISO timestamp started

completed_atstring

ISO timestamp completed

training_stepsnumber

Training steps or epochs completed

artifact_uristring

Produced artifact URI

Show all 10 properties

notesstring

Free-text notes

tagsstring[]

Free-form classification tags

Inherited from BaseNode (6 fields)

idstringrequired

Unique identifier (UUID)

typeNodeTyperequired

Discriminator for the entity type

titlestringrequired

Display name

descriptionstring

Optional detailed description

statusstring

Lifecycle status

tagsstring[]

Freeform tags for filtering

Lifecycle

5 phases, initial: planned · template: STUDY

All lifecycles

Relationships

3 edge types connected to this entity.

Parents

Entities that can contain this type

AI Modelai_model_evaluated_through_ai_experiment

Cross-References

Contextual links across the graph

AI Modelai_experiment_based_on_ai_model

AI Modelai_experiment_uses_ai_model

Graph Position

1parent

👾AI Experiment

2cross-ref

Definition

Usage Guidance

Change one thing at a time and hold the dataset and eval constant, or you won't know what moved the metric.
State the hypothesis and the success criterion before you run.
Log cost and latency alongside quality.
A model that is two points better but three times slower is rarely the right call.

Anti-Patterns

Eyeballing a handful of outputs instead of running a fixed evaluation set turns a vibe check into a decision and rewards whichever variant happened to demo well.
Changing the prompt, model, and retrieval at once leaves you unable to attribute any improvement.
Optimising a single automated score while ignoring qualitative failure modes ships a model that benchmarks well and behaves badly on the cases users actually hit.

Examples

Cheaper model, same quality?

Swap the summarisation model from a large to a mid-tier one, holding the prompt and the 200-document golden set fixed. Result: 1.5-point drop in quality score, 4x lower cost, accepted.

Few-shot vs zero-shot routing

Compare a zero-shot classifier prompt against a 5-example few-shot version on the same intent dataset. Few-shot lifts accuracy from 86% to 92% at a small token-cost increase.