A curated collection of examples used to train, fine-tune, or evaluate an AI model. Versioned and described so results are reproducible.
An AI dataset is the collection of examples a model is trained, fine-tuned, or evaluated on, treated as a governed asset with its own provenance, licensing, and quality record. The data sets the ceiling on what a model can learn, so a problem in the dataset (a contaminated test split, an unlicensed image corpus) becomes a problem in the model that is hard to diagnose after the fact. Tracking the dataset as a first-class thing makes those problems visible before they ship.
The case for documenting datasets as governed assets was made in Datasheets for Datasets, proposed by Timnit Gebru and co-authors in 2018. Borrowing from the datasheets that accompany electronic components, they argued every dataset should ship with a record of its motivation, composition, collection process, and recommended uses. The paper was later published in Communications of the ACM in 2021, and the format now anchors dataset documentation across major model providers.
Three concerns have sharpened since. Provenance asks where the data came from and whether consent or licence covers its use. Contamination asks whether evaluation data has leaked into training, which inflates benchmark scores without improving real performance. Licensing has moved from a footnote to a live legal question as text and image corpora face copyright scrutiny.
A team fine-tunes a classifier on 40,000 labelled support transcripts. Before training, they record the dataset's source (their own production logs), its licence (internal use only), and its date range. During evaluation, they discover that 600 transcripts in the test split also appear in training, so the reported accuracy is overstated. They deduplicate, re-split, and re-run. The honest score drops two points, and the datasheet now documents the contamination check so the next team does not repeat the mistake.
In the Unified Product Graph, an AI dataset sits in the AI and intelligence region as the governed input to training. Its primary edge is AI Modeltrained onAI Datasethierarchy, which records exactly which corpus produced which model. That single link supports the questions governance teams actually ask: if a dataset turns out to be unlicensed, which models are affected, and which products depend on those models.ai_model_trained_on_ai_dataset
Type-specific fields on BaseNode
dataset_typestringPurpose
versionstringVersion
record_countnumberRecords
formatstringFormat (e.g. "jsonl", "csv", "parquet")
storage_uristringStorage URI
checksumstringIntegrity hash
provenancestringOrigin
licensestringSPDX license identifier
tagsstring[]Free-form classification tags
idstringrequiredUnique identifier (UUID)
typeNodeTyperequiredDiscriminator for the entity type
titlestringrequiredDisplay name
descriptionstringOptional detailed description
statusstringLifecycle status
tagsstring[]Freeform tags for filtering
4 phases — initial: collecting
1 edge type connected to this entity.
ai_model_trained_on_ai_dataset