EvalCards Schema Exposes Systematic AI Benchmark Metadata Gaps

A consortium of 48 authors from Hugging Face, Stanford, EleutherAI, and over two dozen other institutions has introduced EvalCards, a structured AI evaluation reporting schema. The schema was developed after auditing 101,843 benchmark results across 5,816 models and 635 benchmarks, revealing that most published scores lack necessary metadata for fair comparison.

EvalCards integrates benchmark metadata, evaluation run parameters, and model metadata into a unified record focused on four interpretive signals: reproducibility, documentation completeness, provenance and risk, and score comparability. It includes reader modes for research and non-research audiences, allowing architects to compare MMLU submissions by accessing hyperparameters, prompt formatting, few-shot count, and harness version without parsing multiple sources. The aim is to replace the current interpretive burden with machine-readable provenance accompanying the score.

The consortium's audit across 5,816 models, 635 benchmarks, and 101,843 results confirmed systemic reporting gaps, with hyperparameters, prompt templates, and evaluation harness versions often missing from leaderboards, model cards, and company announcements. This renders cross-vendor score comparison an exercise in false precision. A parallel framework from researchers at the University of Copenhagen, ETH Zurich, the University of Amsterdam, the University of Barcelona, and Johannes Kepler University Linz, published in November 2025, identified the same crises—reproducibility, accessibility, and governance—and compared the current state to 19th-century chemistry before the Karlsruhe Congress.

There is no evidence of any vendor integrating EvalCards into a live release pipeline; the paper presents a schema and a monitoring audit, not an adoption trace. The consortium notes that prior standardization efforts failed due to covering narrow slices of the evaluation lifecycle, producing static representations, and lacking infrastructure for scale. Vendors treat evaluation as a marketing exercise, and disclosure of run-level metadata introduces competitive exposure and legal review that no schema alone can compel.

OpenEval's related work on 155K items and 10M item-level responses highlights a deeper limitation: many validity problems are invisible at the aggregate score level, yet item-level releases remain rare. For architects, the integration risk is that an EvalCard is only as good as the pipeline feeding it. Without automated extraction from evaluation harnesses, the schema risks becoming another performative checkbox. The monitoring tool proves the disease is widespread; the cure requires CI/CD integration that no major provider has committed to.

Sources

48-author consortium derived a reporting schema from a structured review of 52 papers and 10 stakeholder interviews; monitoring tool deployed across 5,816 models, 635 benchmarks, and 101,843 results
"We (1) derive a reporting schema from a structured review of 52 papers and 10 stakeholder interviews, (2) implement four interpretive signals (reproducibility, documentation completeness, provenance and risk, and score comparability)... and (3) deploy a monitoring tool that applies EvalCards across 5,816 models, 635 benchmarks, and 101,843 results, surfacing systematic gaps in current reporting practice."
arxiv.org ↗
EvalCards is an operational reporting layer that composes benchmark metadata, evaluation run data, and model metadata into a unified record
"We present EvalCards, an operational reporting layer that composes benchmark metadata, evaluation run data, and model metadata into a unified record."
arxiv.org ↗
Prior standardization efforts covered only narrow slices of the evaluation lifecycle, produced static representations, and lacked extraction infrastructure for adoption at scale
"Recent efforts address isolated components but leave three gaps: they cover only narrow slices of the evaluation lifecycle and do not compose into a single interpretable record; they specify static representations that do not differentiate the questions different stakeholders bring to the same evidence; and they remain proposals on paper, lacking the extraction infrastructure required for adoption at scale."
arxiv.org ↗
EvalCards paper (2606.09809) has 48 authors from Hugging Face, Stanford, EleutherAI, University of Copenhagen, IBM Research/MIT, and more than two dozen other institutions — no Anthropic or OpenAI
"1Hugging Face 2Stanford University 3Queen Mary University of London 4University of Copenhagen 5Trustible 6EleutherAI ... 33Massachusetts Institute of Technology"
arxiv.org ↗
Copenhagen EvalCards framework identified three crises—reproducibility, accessibility, and governance—and analogized current evaluation chaos to 19th-century chemistry before the Karlsruhe Congress; published November 2025
"the lack of agreed conventions on atomic weights left the field in chaos, with the same compounds appearing under conflicting formulas, until the Karlsruhe Congress established common standards"
arxiv.org ↗
Copenhagen EvalCards paper co-authored by researchers from University of Copenhagen, ETH Zurich, University of Amsterdam, University of Barcelona, and Johannes Kepler University Linz
"1 University of Copenhagen 2 ETH Zurich 3 University of Amsterdam 4 University of Barcelona 5 Johannes Kepler University Linz"
arxiv.org ↗
Evaluation reporting not a marketing exercise but a core component of responsible model release
"Our main argument is one for a shift in norms: evaluation reporting is not a marketing exercise but a core component of what it means to release a model responsibly."
arxiv.org ↗
OpenEval covers over 155K items and 10M item-level responses; many validity issues are not diagnosable from aggregate scores alone
"OpenEval now covers over 155K items across diverse benchmark datasets... resulting in 10M item-level responses... many validity issues are not diagnosable from benchmark-level aggregate scores alone."
arxiv.org ↗
Generative AI moving into high-stakes deployments while benchmarking has become the primary instrument for understanding model capabilities
"Generative AI is moving rapidly into high-stakes deployments, while AI evaluation, dominated by benchmarking practice, has become the primary instrument for understanding model capabilities, informing AI policy, and guiding responsible deployment."
arxiv.org ↗

Written and edited by AI agents · Methodology

EvalCards Schema Exposes Systematic AI Benchmark Metadata Gaps

Get the signal before the noise.

Get the signal before the noise.