A structured trial that changes one variable in an AI system (a model, a prompt, a parameter) and measures the effect on quality, cost, or latency.
An AI experiment is a structured comparison: one or more models, prompts, or configurations run against a fixed dataset and scored on agreed metricsMetricStrategyA unified metric that measures progress, health, or behaviour across the productView reference →. It is how a team answers "is this version actually better?" with evidenceEvidenceValidationData supporting or refuting a hypothesisView reference → instead of a hunch from three lucky demos.
The discipline is offline evaluation, imported from machine learningLearningValidationAn insight gained from an experimentView reference → and adapted to the open-ended outputs of language models. Where a classifier could be scored against labelled ground truth, generated prose resists exact-match grading. The field's answer was LLM-as-a-judge: use a strong model to score another model's output against a reference or a rubric. Evidently AI and LangSmith both describe the now-standard loop, where an experiment runsExperiment RunValidationAn execution instance of an experiment that records actual conditions, observations, and raw results.View reference → a candidate system over a curated dataset and a judge or metric produces comparable scores. Mature practice keeps a held-out slice of manually labelled examples to validate the judge itself, since an unchecked judge inherits its own biases.
A team wants to swap a prompt that is costing too many tokens. They assemble a dataset of 200 real support questions with hand-written ideal answers. They run two configurations: the live prompt and a leaner candidate. An LLM judge scores each answer for accuracy and helpfulness; a script measures token use. The candidate scores 4.4 against the live prompt's 4.5 on quality, a difference inside the noise, while cutting tokens by 38%. The experimentExperimentValidationA test designed to validate a hypothesisView reference → makes the trade-off legible, and the team ships the cheaper prompt with confidence rather than crossing their fingers in production.
In the Unified Product Graph, an AI Experiment lives in the AI region as the evaluation primitive. The model under test connects through AI Modelevaluated throughAI Experimenthierarchy, and any models the experiment invokes (a judge, a baseline) connect through ai_model_evaluated_through_ai_experimentAI ExperimentusesAI Modelcross-domain. Keeping the experiment as its own node means a model's quality claims point at the evidence that backs them, so "we improved accuracy" is always traceable to the run that demonstrated it.ai_experiment_uses_ai_model
Type-specific fields on BaseNode
projectstringParent project or experiment group
run_namestringHuman-readable run name
configstringSerialised hyperparameters and config
summary_metricsstringKey-metric summary
started_atstringISO timestamp started
completed_atstringISO timestamp completed
foundation_modelstringFoundation starting model
training_stepsnumberTraining steps or epochs completed
artifact_uristringProduced artifact URI
notesstringFree-text notes
tagsstring[]Free-form classification tags
idstringrequiredUnique identifier (UUID)
typeNodeTyperequiredDiscriminator for the entity type
titlestringrequiredDisplay name
descriptionstringOptional detailed description
statusstringLifecycle status
tagsstring[]Freeform tags for filtering
5 phases — initial: planned
2 edge types connected to this entity.
ai_model_evaluated_through_ai_experimentai_experiment_uses_ai_model