What is an AI Dataset?

A versioned, governed collection of examples used to train, fine-tune, or evaluate an AI model, with its own provenance and licensing.

What is the purpose of an AI Dataset?

An AI dataset sets the ceiling on what a model can learn, so a problem in the data becomes a problem in the model that is hard to diagnose afterwards. Tracking it as a first-class asset, with provenance, licensing, and a quality record, makes contaminated splits or unlicensed sources visible before they ship and keeps results reproducible.

How do you use an AI Dataset in product management?

Version every dataset and freeze the split you evaluate against, so a score today means the same thing as a score next quarter. Record provenance, size, and any labelling or filtering steps. Keep evaluation sets strictly separate from training data to avoid contamination.

What are common mistakes with an AI Dataset?

Leaking examples between training and evaluation splits inflates measured performance and hides a model that won't generalise. Using a dataset with no versioning or provenance makes a regression impossible to trace to the data that caused it. Treating a one-off snapshot as permanently representative ignores distribution drift, so the model is quietly evaluated against a world that no longer matches production.

🪆

AI Dataset

Q: What is an example of an AI Dataset?

Support-intent classification set: 12,000 anonymised support tickets, each labelled with one of 9 intents, split 80/10/10. Version v3 added 1,400 examples for a new billing-dispute intent.

A versioned collection of examples for training, fine-tuning, or evaluating a model.

AI & Machine LearningEngineering & Platformtype: 'ai_dataset'interface: BaseNode

View in Graph

▼On this page

Description Properties Lifecycle Relationships Graph Position Related Entities

Description

An AI dataset is the collection of examples a model is trained, fine-tuned, or evaluated on, treated as a governed asset with its own provenance, licensing, and quality record. The data sets the ceiling on what a model can learn, so a problem in the dataset (a contaminated test split, an unlicensed image corpus) becomes a problem in the model that is hard to diagnose after the fact. Tracking the dataset as a first-class thing makes those problems visible before they ship.

See moreSee less

Empirical support for that claim comes from Sambasivan et al.'s study of 53 AI practitioners across India, East and West African countries, and the United States, which found a 92% prevalence of "data cascades" — compounding downstream model failures triggered by an upstream data-quality problem that went unaddressed. The phrase that opens the paper's title, "everyone wants to do the model work, not the data work," names the organisational pattern that lets cascades form: data issues are deprioritised until they surface as inexplicable model behaviour, by which point they are difficult and costly to unwind.

Origin & evolution

The case for documenting datasets as governed assets was made in Datasheets for Datasets, proposed by Timnit Gebru and co-authors in 2018. Borrowing from the datasheets that accompany electronic components, they argued every dataset should ship with a record of its motivation, composition, collection process, and recommended uses. The paper was later published in Communications of the ACM in 2021, and the format now anchors dataset documentation across major model providers.

Three concerns have sharpened since. Provenance asks where the data came from and whether consent or licence covers its use. Contamination asks whether evaluation data has leaked into training, which inflates benchmark scores without improving real performance. Licensing has moved from a footnote to a live legal question as text and image corpora face copyright scrutiny.

How it works in practice

A team fine-tunes a classifier on 40,000 labelled support transcripts. Before training, they record the dataset's source (their own production logs), its licence (internal use only), and its date range. During evaluation, they discover that 600 transcripts in the test split also appear in training, so the reported accuracy is overstated. They deduplicate, re-split, and re-run. The honest score drops two points, and the datasheet now documents the contamination check so the next team does not repeat the mistake.

AI Dataset vs. its neighbours

AI Model is the trained output; the dataset is the input. The model inherits the dataset's biases and gaps, which is why the link between them is tracked rather than assumed.
Data Source is the upstream system the data was pulled from, such as a CRM or a log store. A dataset is a curated, versioned snapshot derived from one or more sources.
Data Classification labels how sensitive the contents are. A dataset carries a classification, and that label drives who may use it for training and where it may be stored.

In the graph

In the Unified Product Graph, an AI dataset sits in the AI and intelligence region as the governed input to training. Its primary edge is ai_model_trained_on_ai_dataset, which records exactly which corpus produced which model. That single link supports the questions governance teams actually ask: if a dataset turns out to be unlicensed, which models are affected, and which products depend on those models.

Preview

Presets

title

dataset_type

versionrecord_countformatstorage_urichecksum

provenance

licensetags

AI Dataset

Trellis Safe Change approval evaluation dataset

Dataset typeevaluationProvenancehuman_labelled

Version1.3.0

Record count2400

Formatjsonl

Storage uris3://trellis-ml/datasets/safe-change-eval-v1.3.jsonl

Checksumsha256:a7f3c29d1e84b05c

Licenseproprietary

Tagssafe-change, approval, governance

Properties

Type-specific fields on BaseNode

dataset_typeenum

Purpose

trainingevaluationfine_tuningrlhfsynthetic

versionstring

Version

record_countnumber

Records

formatstring

Format (e.g. "jsonl", "csv", "parquet")

storage_uristring

Storage URI

checksumstring

Integrity hash

provenanceenum

Origin

human_labelledsyntheticscrapedconvertedmixed

licensestring

SPDX license identifier

tagsstring[]

Free-form classification tags

Inherited from BaseNode (6 fields)

idstringrequired

Unique identifier (UUID)

typeNodeTyperequired

Discriminator for the entity type

titlestringrequired

Display name

descriptionstring

Optional detailed description

statusstring

Lifecycle status

tagsstring[]

Freeform tags for filtering

Lifecycle

4 phases, initial: collecting

All lifecycles

Relationships

2 edge types connected to this entity.

Parents

Entities that can contain this type

AI Modelai_model_trained_on_ai_dataset

Cross-References

Contextual links across the graph

Data Sourceai_dataset_sourced_from_data_source

Graph Position

1parent

🪆AI Dataset

1cross-ref

Definition

An AI dataset is a versioned, governed collection of examples used to train, fine-tune, or evaluate a model, with its own provenance and licensing. It feeds eval benchmarks and experiments, so a result can be traced back to the data behind it.

Usage Guidance

Version every dataset and freeze the split you evaluate against, so a score today means the same thing as a score next quarter.
Record provenance, size, and any labelling or filtering steps.
Keep evaluation sets strictly separate from training data to avoid contamination.

Anti-Patterns

Leaking examples between training and evaluation splits inflates measured performance and hides a model that won't generalise.
Using a dataset with no versioning or provenance makes a regression impossible to trace to the data that caused it.
Treating a one-off snapshot as permanently representative ignores distribution drift, so the model is quietly evaluated against a world that no longer matches production.

Examples

Support-intent classification set

12,000 anonymised support tickets, each labelled with one of 9 intents, split 80/10/10. Version v3 added 1,400 examples for a new billing-dispute intent.

Golden eval set for summarisation

200 documents with human-written reference summaries, frozen at v1 and never used for training, so every model can be scored on the same fixed exam.

AI Dataset

A versioned collection of examples for training, fine-tuning, or evaluating a model.

AI & Machine LearningEngineering & Platformtype: 'ai_dataset'interface: BaseNode

View in Graph

▼On this page

Description Properties Lifecycle Relationships Graph Position Related Entities

Description

See moreSee less

Origin & evolution

How it works in practice

AI Dataset vs. its neighbours

AI Model is the trained output; the dataset is the input. The model inherits the dataset's biases and gaps, which is why the link between them is tracked rather than assumed.
Data Source is the upstream system the data was pulled from, such as a CRM or a log store. A dataset is a curated, versioned snapshot derived from one or more sources.
Data Classification labels how sensitive the contents are. A dataset carries a classification, and that label drives who may use it for training and where it may be stored.

In the graph

Preview

Presets

title

dataset_type

versionrecord_countformatstorage_urichecksum

provenance

licensetags

AI Dataset

Trellis Safe Change approval evaluation dataset

Dataset typeevaluationProvenancehuman_labelled

Version1.3.0

Record count2400

Formatjsonl

Storage uris3://trellis-ml/datasets/safe-change-eval-v1.3.jsonl

Checksumsha256:a7f3c29d1e84b05c

Licenseproprietary

Tagssafe-change, approval, governance

Properties

Type-specific fields on BaseNode

dataset_typeenum

Purpose

trainingevaluationfine_tuningrlhfsynthetic

versionstring

Version

record_countnumber

Records

formatstring

Format (e.g. "jsonl", "csv", "parquet")

storage_uristring

Storage URI

checksumstring

Integrity hash

provenanceenum

Origin

human_labelledsyntheticscrapedconvertedmixed

licensestring

SPDX license identifier

tagsstring[]

Free-form classification tags

Inherited from BaseNode (6 fields)

idstringrequired

Unique identifier (UUID)

typeNodeTyperequired

Discriminator for the entity type

titlestringrequired

Display name

descriptionstring

Optional detailed description

statusstring

Lifecycle status

tagsstring[]

Freeform tags for filtering

Lifecycle

4 phases, initial: collecting

All lifecycles

Relationships

2 edge types connected to this entity.

Parents

Entities that can contain this type

AI Modelai_model_trained_on_ai_dataset

Cross-References

Contextual links across the graph

Data Sourceai_dataset_sourced_from_data_source

Graph Position

1parent

🪆AI Dataset

1cross-ref

Definition

Usage Guidance

Version every dataset and freeze the split you evaluate against, so a score today means the same thing as a score next quarter.
Record provenance, size, and any labelling or filtering steps.
Keep evaluation sets strictly separate from training data to avoid contamination.

Anti-Patterns

Leaking examples between training and evaluation splits inflates measured performance and hides a model that won't generalise.
Using a dataset with no versioning or provenance makes a regression impossible to trace to the data that caused it.
Treating a one-off snapshot as permanently representative ignores distribution drift, so the model is quietly evaluated against a world that no longer matches production.

Examples

Support-intent classification set

12,000 anonymised support tickets, each labelled with one of 9 intents, split 80/10/10. Version v3 added 1,400 examples for a new billing-dispute intent.

Golden eval set for summarisation

200 documents with human-written reference summaries, frozen at v1 and never used for training, so every model can be scored on the same fixed exam.