A structured evaluation comparing multiple AI models or providers on quality, cost, latency, and capability.
A model comparison is a head-to-head evaluation of two or more models on the same taskTaskProduct SpecificationA unit of work within a story or epicView reference →, producing a decisionDecisionStrategyA recorded decision with context, rationale, and consequencesView reference → about which to use. The trade-off it has to resolve is three-way: quality, cost, and latency rarely point at the same model, so a comparison is the place a team makes that tension explicit and picks deliberately. A comparison that reports only quality is half a decision.
Comparing systems on shared tasks is old, but the modern reference point for language models is Chatbot Arena, launched by LMSYS in spring 2023. It shows a user two anonymous model responses to their own prompt and asks only which is better. Those pairwise votes feed an Elo rating, borrowed from chess, that predicts the win rate between any two models. The design captures preferences on real, in-the-wild prompts that a fixed benchmark cannot.
The method has since been refined for statistical honesty. LMSYS moved from the online Elo update to the Bradley-Terry model, which yields more stable ratings and proper confidence intervals, so a small rating gap is no longer mistaken for a real difference. The same lineage runs through the LLM-as-a-judge work in Zheng et al. (2023), which validated automated pairwise judging against human preference and named the position and verbosity biases a naive comparison inherits.
The practical lesson is that a public leaderboard ranks models in general, while a product needsNeedUserA user need, pain, desire, or constraintView reference → the best model for one use case. An arena win tells you which model people prefer on average; it does not tell you which model resolves your refund tickets within your latency budget at a price you can sustain. Teams increasingly run their own comparisons on their own tasks for exactly this reason.
A team choosing a model for real-time chat support runs three candidates against the same 200-ticket eval. Model A scores highest on resolution quality but answers in 3.1 seconds at the highest per-token price. Model B scores two points lower at 0.9 seconds and a third of the cost. Model C trails on quality. For a live chat where users wait on each reply, the team picks B: the latency and cost advantages outweigh a two-point quality gap that users will not notice but a three-second pause they certainly will. The comparison made the trade-off arguable on evidenceEvidenceValidationData supporting or refuting a hypothesisView reference →, and the winner is recorded against the reasons it won.
In the Unified Product Graph, Model ComparisonAI & Machine LearningA comparison between AI models lives in the model_comparisonai domain and closes the Model Evaluation Loop. Candidates connect through AI Modelcompared inModel Comparisonhierarchy, the chosen model is named by ai_model_compared_in_model_comparisonModel Comparisonwon byAI Modelcross-domain, the product that ran the comparison links via model_comparison_winner_is_ai_modelProductcompared viaModel Comparisonhierarchy, and the choice flows outward through product_compared_via_model_comparisonModel ComparisoninformsDecisioncross-domain. Capturing the comparison as its own entity, with an explicit winner edge and a decision edge, is what keeps a model choice from becoming folklore: the graph holds which models were weighed, which one won, and the decision that choice fed.model_comparison_informs_decision
Type-specific fields on BaseNode
model_idsstring[]Compared model identifiers
comparison_criteriastring[]Comparison dimensions (e.g. "accuracy", "cost", "latency")
comparison_datestringISO conduct date
idstringrequiredUnique identifier (UUID)
typeNodeTyperequiredDiscriminator for the entity type
titlestringrequiredDisplay name
descriptionstringOptional detailed description
statusstringLifecycle status
tagsstring[]Freeform tags for filtering
4 phases — initial: planned
4 edge types connected to this entity.