What is the purpose of an Experiment Run?

An experiment run records what actually happened when a plan was carried out: the real dates, the sample that materialised, anything that went wrong, and the raw numbers. Holding the run separately from the plan means a flawed execution can be flagged without discrediting the underlying design, and results are always read against the criteria fixed beforehand.

How do you use an Experiment Run in product management?

Record the real conditions, not the intended ones, actual sample size, actual duration, any contamination or outage during the window. Note deviations from the plan honestly; a run that diverged is still informative if you know how. Link the run to its plan and let the pre-set criteria, not hindsight, decide the verdict.

Where does the concept of an Experiment Run come from?

Running product experiments to settle questions comes from the lean and growth traditions, with Eric Ries building The Lean Startup (2011) around the Build-Measure-Learn loop. The statistical machinery arrived with the peeking problem: Evan Miller showed that checking results repeatedly inflates false positives, and Ramesh Johari and colleagues formalised the fix with always-valid p-values in Peeking at A/B Tests (KDD, 2017). The field settled on a clean separation of plan, run and result, so a failed run still counts as evidence.

What are common mistakes with an Experiment Run?

Stopping a run the moment it crosses significance ("peeking") inflates false positives and is one of the most common ways online experiments mislead. Quietly editing a run's record to match the story you want to tell destroys the very trace that makes the result trustworthy. And reading a contaminated run (an outage, a concurrent launch, a botched split) as clean signal turns an honest miss into a confident wrong conclusion.

▶️

Experiment Run

Q: What is an example of an Experiment Run?

Pricing A/B, March run: Ran 3–17 March, 4,210 visitors per arm. Annual-plan selection rose 6.1 points (p = 0.02). Note: a 40-minute checkout outage on day 4 affected the control arm only.

A single execution of an experiment plan, capturing the actual dates, sample, conditions, and raw results.

ValidationDiscovery, Research & Validationtype: 'experiment_run'interface: BaseNode

View in Graph

▼On this page

Description Properties Lifecycle Relationships Graph Position Frameworks Related Entities

Description

An experiment run is a single execution of an experiment plan: one A/B test that actually shipped to traffic, one fake-door that ran for a fortnight, one cohort you followed for thirty days. The plan is the design; the run is the instance that produced data on specific dates with specific users.

See moreSee less

Origin & evolution

The discipline of running product experiments to settle questions comes from the lean and growth traditions. Eric Ries built *The Lean Startup* (2011) around the Build-Measure-Learn loop and the idea of validated learning, where a value hypothesis and a growth hypothesis are tested against real customer behaviour, not asserted from the desk. A run is the "measure" leg of that loop made concrete: the loop turns once per run.

The statistical machinery is older and arrived with a warning attached. Online controlled experiments inherited the frequentist test, and with it the peeking problem. Evan Miller's widely-read How Not To Run an A/B Test showed that checking results repeatedly and stopping the moment significance appears inflates the false-positive rate far beyond the nominal five per cent. Ramesh Johari and colleagues formalised the fix in the 2017 KDD paper Peeking at A/B Tests, introducing always-valid p-values and sequential tests that stay honest under continuous monitoring. The lesson reshaped how a run is defined: a run has a pre-registered sample size or a sequential stopping rule, and analysing it is part of the run, not a separate liberty the analyst takes whenever the dashboard looks good.

Where the field landed is a clean separation of three things. The plan states the hypothesis, the metric, the minimum detectable effect, and the stopping rule. The run executes that plan once. The result is what the run yields, kept distinct so that a failed run still counts as evidence and a single plan can be run more than once without anyone pretending the reruns were one test.

How it works in practice

A team believes a shorter signup form will lift completion. The experiment plan fixes the success metric at signup-completion rate, sets a minimum detectable effect of two percentage points, and a sample size of 40,000 visitors per arm to reach 80 per cent power. That is the design, and it does not move.

The run starts on the first of the month and splits live traffic 50/50. Eleven days in, the variant is up 3.1 points and someone wants to call it. The pre-committed rule says wait for 40,000 per arm; with a sequential design, the always-valid boundary has not been crossed. They hold. By day eighteen the lift settles at 1.4 points, below the threshold that justified the work. The run produced a clear result, just not the hoped-for one, and because the stopping rule was set in the plan, nobody can argue the team peeked their way to a phantom win.

Experiment run vs. its neighbours

Experiment plan is the reusable design: hypothesis, metric, stopping rule. The run is one execution of it. The same plan can produce several runs across quarters or markets, and conflating plan with run is how reruns get laundered into a single inflated result.
Hypothesis is the testable belief the run is built to challenge. The run does not contain the belief; it subjects it to evidence. One hypothesis may need several runs before it is confidently validated or killed.
Learning is the durable conclusion a run produces, the thing the team carries forward. A run yields evidence; the learning is the interpretation that survives. A run with no recorded learning has spent traffic without recording a conclusion.

In the graph

In the Unified Product Graph, experiment_run sits in the validation region as the instance node between design and conclusion. experiment_plan_ran_as_experiment_run records that a run is one execution of a plan, which makes reruns first-class and countable. experiment_run_validates_hypothesis ties the run to the belief under test, experiment_run_yields_evidence captures the raw measured outcome, and experiment_run_produces_learning records what the team concluded. Separating evidence from learning matters: it preserves the audit trail when a later run contradicts an earlier one, and it makes the peeking problem structurally visible, because a hypothesis "validated" by a single run with no pre-registered plan is queryably weak.

Worked example: Trellis

One run of the 10 percent rollout recorded the actual share of agent-proposed changes that directors approved without reverting, and the retention gap between treated and control workspaces at week 4. A single experiment run is the raw evidential record: the approved-versus-reverted rates and retention numbers that either support or refute the hypothesis that previewed, reversible change drives director trust.

Preview

Presets

titleactual_start_dateactual_end_dateactual_reachoutcome_summary

severity_of_finding

Mild inconvenience Notices but works around easily

learning

disposition

Experiment Run

10 percent Safe Change rollout, week 4

Severity of findingMild inconvenienceDispositionconfirmed

Actual start date2025-10-07

Actual end date2025-11-04

Actual reach312

Outcome summaryApproved-to-reverted ratio was 11:1 in the treatment group; week-4 retention was 6 points above control.

LearningDirectors in the treatment group granted the agent substantially more autonomy once they saw changes could be explained and undone. Hesitation converted into trust across all company-size bands.

Properties

Type-specific fields on BaseNode

actual_start_datestring

ISO actual start date (may differ from the plan's `planned_start_date`)

actual_end_datestring

ISO actual end date

actual_reachnumber

Observed reach: how many people the run actually touched

outcome_summarystring

Plain-English outcome

severity_of_findingassessment

Severity / strength of the finding (UPGAssessment)

Severity (5-point) scale →

Mild inconvenience

Notices but works around easily

Annoying

Frustrated but can continue

Significant

Has to change approach

Severe

Struggles to accomplish goal

Blocker

Cannot accomplish goal

learningstring

What the team learned (rich text)

dispositionenum

Resolution against the parent plan's success criteria. `confirmed` = evidence supports the parent hypothesis_claim. `disconfirmed` = evidence refutes the parent hypothesis_claim. `inconclusive` = data insufficient or noisy. `aborted` = run terminated early.

confirmeddisconfirmedinconclusiveaborted

Inherited from BaseNode (6 fields)

idstringrequired

Unique identifier (UUID)

typeNodeTyperequired

Discriminator for the entity type

titlestringrequired

Display name

descriptionstring

Optional detailed description

statusstring

Lifecycle status

tagsstring[]

Freeform tags for filtering

Lifecycle

5 phases, initial: planned · template: STUDY

All lifecycles

Relationships

19 edge types connected to this entity.

Parents

Entities that can contain this type

Experiment Planexperiment_plan_ran_as_experiment_run

Experimentexperiment_executed_as_experiment_run

Experiment Runexperiment_run_tested_via_experiment_run

Dashboarddashboard_contains_experiment_run

Children

Entities this type can contain

Learningexperiment_run_produces_learning

Evidenceexperiment_run_yields_evidence

Metricexperiment_run_measures_metric

Experiment Runexperiment_run_tested_via_experiment_run

Cross-References

Contextual links across the graph

Hypothesisexperiment_run_validates_hypothesis

Insightexperiment_run_produced_insight_insight

Decisionexperiment_run_informed_decision_decision

Beta Programbeta_program_runs_experiment_run

Variantexperiment_run_tests_variant

Cohortcohort_exposed_to_experiment_run

Pricing Tierexperiment_run_tests_pricing_tier

Featureexperiment_run_tests_feature

Show all 11 cross-references

Metricexperiment_run_guards_metric

Metricexperiment_run_measured_by_metric

Outcomeexperiment_run_measures_outcome

Graph Position

4parents

▶️Experiment Run

4children

11cross-ref

Used in Frameworks

2 frameworks use this entity type.

Opportunity Solution Treediscovery Hypothesis Boardvalidation

Definition

An experiment run is a single execution of an experiment plan, capturing the actual dates, sample, conditions, and raw results of one instance. It links back to its plan so results are read against pre-fixed criteria and a flawed run can be flagged without discrediting the design.

Usage Guidance

Record the real conditions, not the intended ones, actual sample size, actual duration, any contamination or outage during the window.
Note deviations from the plan honestly; a run that diverged is still informative if you know how.
Link the run to its plan and let the pre-set criteria, not hindsight, decide the verdict.

Anti-Patterns

Stopping a run the moment it crosses significance ("peeking") inflates false positives and is one of the most common ways online experiments mislead.
Quietly editing a run's record to match the story you want to tell destroys the very trace that makes the result trustworthy.
And reading a contaminated run (an outage, a concurrent launch, a botched split) as clean signal turns an honest miss into a confident wrong conclusion.

Examples

Pricing A/B, March run

Ran 3–17 March, 4,210 visitors per arm. Annual-plan selection rose 6.1 points (p = 0.02). Note: a 40-minute checkout outage on day 4 affected the control arm only.

Re-run after the fix

The same plan re-executed in April with the outage resolved; the lift held at 5.4 points, confirming the March result was not an artefact of the incident.

Experiment Run

A single execution of an experiment plan, capturing the actual dates, sample, conditions, and raw results.

ValidationDiscovery, Research & Validationtype: 'experiment_run'interface: BaseNode

View in Graph

▼On this page

Description Properties Lifecycle Relationships Graph Position Frameworks Related Entities

Description

See moreSee less

Origin & evolution

How it works in practice

Experiment run vs. its neighbours

Experiment plan is the reusable design: hypothesis, metric, stopping rule. The run is one execution of it. The same plan can produce several runs across quarters or markets, and conflating plan with run is how reruns get laundered into a single inflated result.
Hypothesis is the testable belief the run is built to challenge. The run does not contain the belief; it subjects it to evidence. One hypothesis may need several runs before it is confidently validated or killed.
Learning is the durable conclusion a run produces, the thing the team carries forward. A run yields evidence; the learning is the interpretation that survives. A run with no recorded learning has spent traffic without recording a conclusion.

In the graph

Worked example: Trellis

Preview

Presets

titleactual_start_dateactual_end_dateactual_reachoutcome_summary

severity_of_finding

Mild inconvenience Notices but works around easily

learning

disposition

Experiment Run

10 percent Safe Change rollout, week 4

Severity of findingMild inconvenienceDispositionconfirmed

Actual start date2025-10-07

Actual end date2025-11-04

Actual reach312

Outcome summaryApproved-to-reverted ratio was 11:1 in the treatment group; week-4 retention was 6 points above control.

Properties

Type-specific fields on BaseNode

actual_start_datestring

ISO actual start date (may differ from the plan's `planned_start_date`)

actual_end_datestring

ISO actual end date

actual_reachnumber

Observed reach: how many people the run actually touched

outcome_summarystring

Plain-English outcome

severity_of_findingassessment

Severity / strength of the finding (UPGAssessment)

Severity (5-point) scale →

Mild inconvenience

Notices but works around easily

Annoying

Frustrated but can continue

Significant

Has to change approach

Severe

Struggles to accomplish goal

Blocker

Cannot accomplish goal

learningstring

What the team learned (rich text)

dispositionenum

confirmeddisconfirmedinconclusiveaborted

Inherited from BaseNode (6 fields)

idstringrequired

Unique identifier (UUID)

typeNodeTyperequired

Discriminator for the entity type

titlestringrequired

Display name

descriptionstring

Optional detailed description

statusstring

Lifecycle status

tagsstring[]

Freeform tags for filtering

Lifecycle

5 phases, initial: planned · template: STUDY

All lifecycles

Relationships

19 edge types connected to this entity.

Parents

Entities that can contain this type

Experiment Planexperiment_plan_ran_as_experiment_run

Experimentexperiment_executed_as_experiment_run

Experiment Runexperiment_run_tested_via_experiment_run

Dashboarddashboard_contains_experiment_run

Children

Entities this type can contain

Learningexperiment_run_produces_learning

Evidenceexperiment_run_yields_evidence

Metricexperiment_run_measures_metric

Experiment Runexperiment_run_tested_via_experiment_run

Cross-References

Contextual links across the graph

Hypothesisexperiment_run_validates_hypothesis

Insightexperiment_run_produced_insight_insight

Decisionexperiment_run_informed_decision_decision

Beta Programbeta_program_runs_experiment_run

Variantexperiment_run_tests_variant

Cohortcohort_exposed_to_experiment_run

Pricing Tierexperiment_run_tests_pricing_tier

Featureexperiment_run_tests_feature

Show all 11 cross-references

Metricexperiment_run_guards_metric

Metricexperiment_run_measured_by_metric

Outcomeexperiment_run_measures_outcome

Graph Position

4parents

▶️Experiment Run

4children

11cross-ref

Used in Frameworks

2 frameworks use this entity type.

Opportunity Solution Treediscovery Hypothesis Boardvalidation

Definition

Usage Guidance

Record the real conditions, not the intended ones, actual sample size, actual duration, any contamination or outage during the window.
Note deviations from the plan honestly; a run that diverged is still informative if you know how.
Link the run to its plan and let the pre-set criteria, not hindsight, decide the verdict.

Anti-Patterns

Stopping a run the moment it crosses significance ("peeking") inflates false positives and is one of the most common ways online experiments mislead.
Quietly editing a run's record to match the story you want to tell destroys the very trace that makes the result trustworthy.
And reading a contaminated run (an outage, a concurrent launch, a botched split) as clean signal turns an honest miss into a confident wrong conclusion.

Examples

Pricing A/B, March run

Ran 3–17 March, 4,210 visitors per arm. Annual-plan selection rose 6.1 points (p = 0.02). Note: a 40-minute checkout outage on day 4 affected the control arm only.

Re-run after the fix

The same plan re-executed in April with the outage resolved; the lift held at 5.4 points, confirming the March result was not an artefact of the incident.