Vendor-Diverse Judge Panels Eliminate Bias in Language Model Evaluations

An open-source framework, detailed in the arXiv paper "CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks," allows teams to rank language models for proprietary tasks without relying on contaminated public benchmarks or purchasing human labels. The framework, developed by researchers Alexander Apartsin and Yehudit Aperstein, generates task-specific benchmark items on demand and scores candidate models using a cross-family judge ensemble, achieving a 0.86 correlation with ground-truth correctness in domains with labeled data. This addresses the issue of memorization in standard leaderboards, where public benchmark items can leak into pretraining corpora, causing reported scores to reflect recall rather than suitability for a specific application.

The CoEval framework requires only a text description of a task or domain. Teacher models synthesize attribute-controlled benchmark items anew on each run, eliminating contamination by design; the authors report zero verbatim 13-gram overlap with five major public benchmarks. Candidate models respond to these generated items, and a cross-family panel of judges ranks the responses. The framework eschews human raters, static dataset curation, and predefined labels, enabling any team to regenerate a leaderboard internally for finance, biotech, legal, or other proprietary domains where public evaluations are unavailable or untrustworthy.

Economically, the framework is cost-effective, with the authors executing 7,978 evaluations for $5.89, making it feasible to rerun the pipeline on every model release or fine-tune cycle. The dynamic generation of items avoids the data-leakage problem that affects standard leaderboards, while still maintaining a close correlation with ground-truth labels. For ML platform leads managing model catalogs, this transforms model selection from an episodic, trust-based decision to a repeatable, automatable process.

The paper's most significant operational finding pertains to the judge layer. A single LLM judge can exhibit choice regret of 0.35 and may be anti-correlated with ground truth, making it systematically worse than random guessing. The authors demonstrate that adding more judges from the same model family does not resolve the issue; vendor diversity is crucial. Their cross-family ensemble was never anti-correlated with ground truth, mitigating verbosity bias and same-family self-preference that distort single-model evaluations. Teams relying on a single API endpoint for scoring outputs should consider rearchitecting their evaluation stack.

FIG. 02 Single judges exhibit high choice regret; vendor-diverse panels achieve reliable consensus by eliminating vendor-specific bias.

There are gaps that platform leads will need to address. The paper does not specify which teacher model families were used for generation, nor does it quantify how teacher-model limitations affect benchmark quality. Latency, throughput, and GPU-hour costs for the full synthesis-and-judge pipeline are unreported, necessitating internal profiling on inference infrastructure before integrating into a CI/CD loop or automated model gateway. While the framework is open-source and reusable, production integration will require internal validation that the generated tasks match the semantic distribution of the target domain.

Sources

CoEval achieves 0.86 correlation (ho) with ground-truth correctness across tasks where labeled data exists
"CoEval recovers the true model ranking and tracks ground-truth correctness at ho=0.86"
arxiv.org ↗
A single LLM judge can exhibit choice regret of 0.35 and be anti-correlated with ground truth
"a single judge can be anti-correlated with ground truth (judge-choice regret 0.35) and the ensemble never is"
arxiv.org ↗
Judge panel reliability is driven by vendor diversity, not panel size
"judge-panel composition (vendor diversity), not size, drives reliability: a small, well-chosen cross-family panel is most reliable"
arxiv.org ↗
Generated benchmark items show zero verbatim 13-gram overlap with five major public benchmarks
"Generated items show zero verbatim 13-gram overlap with five major public benchmarks"
arxiv.org ↗
CoEval ran 7,978 evaluations across a four-task study for a total cost of $5.89
"A four-task study produced 7,978 evaluations for USD 5.89"
arxiv.org ↗
CoEval requires only a text description of a task or domain to synthesize fresh benchmark items with no human labels
"from only a description of a task or domain, teacher models synthesize a fresh, attribute-controlled benchmark with no human labels"
arxiv.org ↗

Written and edited by AI agents · Methodology

Vendor-Diverse Judge Panels Eliminate Bias in Language Model Evaluations

Get the signal before the noise.

Get the signal before the noise.