What is the purpose of a Test Result?

A test result is the signal teams gate releases on, where green means safe and red means broken. Its value rests on that meaning holding: a result that flips without the code changing is called flaky and undermines the trust the signal depends on.

How do you use a Test Result in product management?

Record one result per execution, not per test definition, so history is preserved across runs. Always attach the build or commit, the environment, and the failure evidence (stack trace, diff, screenshot) so a red result is debuggable without re-running. Link failing results to the investigation, fix, or root cause they triggered.

Where does the concept of a Test Result come from?

The pass/fail verdict is foundational to automated testing, with skip arriving as an honest third state for tests deliberately not run. The fourth state, flaky, reshaped how teams read results: Google quantified the problem in Flaky Tests at Google and How We Mitigate Them (2016), finding about 1.5 per cent of all test runs flaky and roughly 84 per cent of pass-to-fail transitions caused by flakiness rather than a real regression. That finding pushed the field from reading single results toward reading trends, since a single run is a data point and the honest reading of a suite is the pattern across runs.

What are common mistakes with a Test Result?

The most common misuse is treating a result as the test itself, overwriting the previous outcome on each run so no execution history survives, which makes flakiness and regression invisible. Recording only pass/fail without the build, environment, and failure evidence produces a red signal nobody can debug, so the result gets ignored or re-run blindly. Teams also let a single result stand in for many environments, hiding the cases where a test passes locally but fails in CI. Finally, results that are never linked back to a fix or root cause become noise: a wall of red that the team learns to scroll past rather than act on.

🎰

Test Result

Q: What is a Test Result?

The verdict a test case or suite emits when it runs: pass, fail, skip, or a flaky state.

Q: What is an example of a Test Result?

Failed assertion on a checkout regression: Test case 'applies 10% loyalty discount at checkout' fails on build #4821 in the staging environment. The result records the assertion (expected £45.00, received £50.00), a stack trace, and a link to the fix that corrected the rounding logic.

The pass/fail outcome of one test run, with its evidence.

Quality AssuranceOperations & Qualitytype: 'test_result'interface: BaseNode

View in Graph

▼On this page

Description Properties Relationships Graph Position Related Entities

Description

A test result is the verdict a test case or suite emits when it runs: pass, fail, skip, or a flaky state. It is the signal a team gates releases on, and its value depends on a green meaning safe and a red meaning broken. A result that flips without the code changing is called flaky.

See moreSee less

Origin & evolution

The pass/fail verdict is foundational to automated testing, and for most of its history the model was binary: a test asserted something, and the assertion either held or it did not. Skip arrived as an honest third state for tests deliberately not run, on an unsupported platform, behind a disabled feature flag, or pending a fix.

Kent Beck's formulation of test-driven development made the pass/fail verdict a tight feedback loop: write a failing test, write the minimum code to make it pass, then refactor. In that cycle the result is not merely a report but the primary signal driving the next action, which is why the binary meaning of red and green matters so much — ambiguity in the verdict breaks the loop.

The fourth state, flaky, is the one that reshaped how teams read results. A flaky test produces both passing and failing outcomes against the same code, so its result carries no information about whether the product works. Google quantified the scale of the problem in its widely cited post, Flaky Tests at Google and How We Mitigate Them (2016): about 1.5% of all test runs were flaky, and roughly 84% of observed pass-to-fail transitions were caused by flakiness rather than a real regression. At that scale, a single run's result becomes nearly meaningless as a gate, because most red is noise.

That finding pushed the field from reading single results toward reading trends. Google built tooling to automatically quarantine tests above a flakiness threshold, removing them from the critical path and filing bugs against them. The lesson generalised: a result is a data point, and the honest reading of a suite is the pattern across runs, not the verdict of the latest one.

Humble and Farley's deployment pipeline model formalised results as stage gates: each pipeline stage runs a class of tests and its aggregate result must be green before work advances to the next stage. On that model a result is not just a verdict on a single run but a condition that controls whether software can move toward production — which is why a persistently unreliable result damages more than a single build; it erodes the pipeline's authority as a release gate.

How it works in practice

A team's CI shows a failing build. The failed test is an end-to-end checkout test asserting that an order confirmation appears within three seconds. An engineer re-runs it; it passes. Re-runs again; fails. Same commit each time. The result is flaky, and the cause is a race between the test and an asynchronous email-send the page waits on.

Treating the single red as a real failure would have blocked a clean release for nothing. Reading the trend instead, the team sees this test has flipped on 9 of the last 40 runs with no related code changes. They quarantine it out of the merge gate, file a bug, and fix the race by waiting on an explicit confirmation event rather than a timeout. The result becomes deterministic again, and only then does it return to gating releases. The verdict regained its authority by becoming repeatable.

Test result vs. its neighbours

Test case is what runs; the result is what running it produces. The case is the durable assertion you maintain; the result is the per-run, disposable verdict it emits. One case yields a new result on every execution, through test_case_produces_test_result.
Test suite produces an aggregate result; the case produces a granular one. The suite-level result is the roll-up that gates the build through test_suite_produces_test_result; a single case result tells you which assertion moved.
Bug is a confirmed defect; a failing result is a candidate for one. A red result might be a real regression, a flaky test, or a wrong assertion. The triage step is deciding which, and only the first becomes a bug.

Michael Feathers introduces a third reading of a result that is relevant here: the characterization test. Where a normal test asserts what code should do, a characterization test records what code currently does, and its result documents existing behavior rather than desired behavior. A red result from a characterization test does not mean the code is wrong; it means behavior has changed. That distinction — whether a result is checking a contract or pinning an observation — shapes how a team should act on a failure.

In the graph

In the Unified Product Graph, test_result sits in the quality and testing region as the emitted signal of verification. It is produced both at the granular level through test_case_produces_test_result and at the aggregate level through test_suite_produces_test_result. Modelling the result as its own node, separate from the case that emits it, is what lets the graph hold history: a sequence of results over time exposes flakiness, surfaces trends, and turns "the build is red" into the more useful question of whether this red has been red before.

Preview

Presets

title

result_status

duration_msretry_indexerror_messageversion_testedexecuted_atattachmentscomment

Test Result

Safe Change preview render: passed

Result statuspassed

Duration ms1420

Retry index0

Version testedv2.4.0-rc.1

Executed at2026-06-10T09:14:22Z

Attachmentssafe-change-preview-screenshot.png, trace.json

CommentAll assertions on record-level diff rendering passed with no regressions.

Properties

Type-specific fields on BaseNode

result_statusenum

Outcome of this execution. passed = all assertions met; failed = one or more assertions failed; timed_out = execution exceeded the timeout; skipped = test was not run; interrupted = test was stopped mid-run.

passedfailedtimed_outskippedinterrupted

duration_msnumber

Duration of this execution in milliseconds

retry_indexnumber

Retry index. 0 = first attempt, 1 = first retry, etc.

error_messagestring

Error message if the test failed

version_testedstring

Version of the product or build under test

executed_atstring

ISO timestamp of the execution. @example "2026-04-05T14:30:00Z"

attachmentsstring

Comma-separated list of attachment names or URLs (screenshots, logs, traces)

commentstring

Notes or commentary about this result

Inherited from BaseNode (6 fields)

idstringrequired

Unique identifier (UUID)

typeNodeTyperequired

Discriminator for the entity type

titlestringrequired

Display name

descriptionstring

Optional detailed description

statusstring

Lifecycle status

tagsstring[]

Freeform tags for filtering

Relationships

3 edge types connected to this entity.

Parents

Entities that can contain this type

Test Suitetest_suite_produces_test_result

Test Casetest_case_produces_test_result

Cross-References

Contextual links across the graph

Bugtest_result_reports_bug

Graph Position

2parents

🎰Test Result

1cross-ref

Definition

A test result records the outcome of running a test case: pass, fail, skipped, or errored, against a specific build with the evidence behind it. It is produced by a test case running in a test environment and rolls up into the test suite and coverage report.

Usage Guidance

Record one result per execution, not per test definition, so history is preserved across runs.
Always attach the build or commit, the environment, and the failure evidence (stack trace, diff, screenshot) so a red result is debuggable without re-running.
Link failing results to the investigation, fix, or root cause they triggered.

Anti-Patterns

The most common misuse is treating a result as the test itself, overwriting the previous outcome on each run so no execution history survives, which makes flakiness and regression invisible.
Recording only pass/fail without the build, environment, and failure evidence produces a red signal nobody can debug, so the result gets ignored or re-run blindly.
Teams also let a single result stand in for many environments, hiding the cases where a test passes locally but fails in CI.
Finally, results that are never linked back to a fix or root cause become noise: a wall of red that the team learns to scroll past rather than act on.

Examples

Failed assertion on a checkout regression

Test case 'applies 10% loyalty discount at checkout' fails on build #4821 in the staging environment. The result records the assertion (expected £45.00, received £50.00), a stack trace, and a link to the fix that corrected the rounding logic.

Flaky integration test flagged across runs

An end-to-end login test passes on 17 of 20 nightly runs against the same commit. The aggregated results expose a 15% flake rate, prompting a root cause investigation into a race condition in the auth callback.

Test Result

The pass/fail outcome of one test run, with its evidence.

Quality AssuranceOperations & Qualitytype: 'test_result'interface: BaseNode

View in Graph

▼On this page

Description Properties Relationships Graph Position Related Entities

Description

See moreSee less

Origin & evolution

How it works in practice

Test result vs. its neighbours

Test case is what runs; the result is what running it produces. The case is the durable assertion you maintain; the result is the per-run, disposable verdict it emits. One case yields a new result on every execution, through test_case_produces_test_result.
Test suite produces an aggregate result; the case produces a granular one. The suite-level result is the roll-up that gates the build through test_suite_produces_test_result; a single case result tells you which assertion moved.
Bug is a confirmed defect; a failing result is a candidate for one. A red result might be a real regression, a flaky test, or a wrong assertion. The triage step is deciding which, and only the first becomes a bug.

In the graph

Preview

Presets

title

result_status

duration_msretry_indexerror_messageversion_testedexecuted_atattachmentscomment

Test Result

Safe Change preview render: passed

Result statuspassed

Duration ms1420

Retry index0

Version testedv2.4.0-rc.1

Executed at2026-06-10T09:14:22Z

Attachmentssafe-change-preview-screenshot.png, trace.json

CommentAll assertions on record-level diff rendering passed with no regressions.

Properties

Type-specific fields on BaseNode

result_statusenum

passedfailedtimed_outskippedinterrupted

duration_msnumber

Duration of this execution in milliseconds

retry_indexnumber

Retry index. 0 = first attempt, 1 = first retry, etc.

error_messagestring

Error message if the test failed

version_testedstring

Version of the product or build under test

executed_atstring

ISO timestamp of the execution. @example "2026-04-05T14:30:00Z"

attachmentsstring

Comma-separated list of attachment names or URLs (screenshots, logs, traces)

commentstring

Notes or commentary about this result

Inherited from BaseNode (6 fields)

idstringrequired

Unique identifier (UUID)

typeNodeTyperequired

Discriminator for the entity type

titlestringrequired

Display name

descriptionstring

Optional detailed description

statusstring

Lifecycle status

tagsstring[]

Freeform tags for filtering

Relationships

3 edge types connected to this entity.

Parents

Entities that can contain this type

Test Suitetest_suite_produces_test_result

Test Casetest_case_produces_test_result

Cross-References

Contextual links across the graph

Bugtest_result_reports_bug

Graph Position

2parents

🎰Test Result

1cross-ref

Definition

Usage Guidance

Record one result per execution, not per test definition, so history is preserved across runs.
Always attach the build or commit, the environment, and the failure evidence (stack trace, diff, screenshot) so a red result is debuggable without re-running.
Link failing results to the investigation, fix, or root cause they triggered.

Anti-Patterns

The most common misuse is treating a result as the test itself, overwriting the previous outcome on each run so no execution history survives, which makes flakiness and regression invisible.
Recording only pass/fail without the build, environment, and failure evidence produces a red signal nobody can debug, so the result gets ignored or re-run blindly.
Teams also let a single result stand in for many environments, hiding the cases where a test passes locally but fails in CI.
Finally, results that are never linked back to a fix or root cause become noise: a wall of red that the team learns to scroll past rather than act on.

Examples

Failed assertion on a checkout regression

Flaky integration test flagged across runs