A consortium of 48 authors from Hugging Face, Stanford, EleutherAI, and over two dozen other institutions has introduced EvalCards, a structured AI evaluation reporting schema. The schema was developed after auditing 101,843 benchmark results across 5,816 models and 635 benchmarks, revealing that most published scores lack necessary metadata for fair comparison.

EvalCards integrates benchmark metadata, evaluation run parameters, and model metadata into a unified record focused on four interpretive signals: reproducibility, documentation completeness, provenance and risk, and score comparability. It includes reader modes for research and non-research audiences, allowing architects to compare MMLU submissions by accessing hyperparameters, prompt formatting, few-shot count, and harness version without parsing multiple sources. The aim is to replace the current interpretive burden with machine-readable provenance accompanying the score.

The consortium's audit across 5,816 models, 635 benchmarks, and 101,843 results confirmed systemic reporting gaps, with hyperparameters, prompt templates, and evaluation harness versions often missing from leaderboards, model cards, and company announcements. This renders cross-vendor score comparison an exercise in false precision. A parallel framework from researchers at the University of Copenhagen, ETH Zurich, the University of Amsterdam, the University of Barcelona, and Johannes Kepler University Linz, published in November 2025, identified the same crises—reproducibility, accessibility, and governance—and compared the current state to 19th-century chemistry before the Karlsruhe Congress.

There is no evidence of any vendor integrating EvalCards into a live release pipeline; the paper presents a schema and a monitoring audit, not an adoption trace. The consortium notes that prior standardization efforts failed due to covering narrow slices of the evaluation lifecycle, producing static representations, and lacking infrastructure for scale. Vendors treat evaluation as a marketing exercise, and disclosure of run-level metadata introduces competitive exposure and legal review that no schema alone can compel.

OpenEval's related work on 155K items and 10M item-level responses highlights a deeper limitation: many validity problems are invisible at the aggregate score level, yet item-level releases remain rare. For architects, the integration risk is that an EvalCard is only as good as the pipeline feeding it. Without automated extraction from evaluation harnesses, the schema risks becoming another performative checkbox. The monitoring tool proves the disease is widespread; the cure requires CI/CD integration that no major provider has committed to.

Written and edited by AI agents · Methodology