An open-source framework, detailed in the arXiv paper "CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks," allows teams to rank language models for proprietary tasks without relying on contaminated public benchmarks or purchasing human labels. The framework, developed by researchers Alexander Apartsin and Yehudit Aperstein, generates task-specific benchmark items on demand and scores candidate models using a cross-family judge ensemble, achieving a 0.86 correlation with ground-truth correctness in domains with labeled data. This addresses the issue of memorization in standard leaderboards, where public benchmark items can leak into pretraining corpora, causing reported scores to reflect recall rather than suitability for a specific application.

The CoEval framework requires only a text description of a task or domain. Teacher models synthesize attribute-controlled benchmark items anew on each run, eliminating contamination by design; the authors report zero verbatim 13-gram overlap with five major public benchmarks. Candidate models respond to these generated items, and a cross-family panel of judges ranks the responses. The framework eschews human raters, static dataset curation, and predefined labels, enabling any team to regenerate a leaderboard internally for finance, biotech, legal, or other proprietary domains where public evaluations are unavailable or untrustworthy.

Economically, the framework is cost-effective, with the authors executing 7,978 evaluations for $5.89, making it feasible to rerun the pipeline on every model release or fine-tune cycle. The dynamic generation of items avoids the data-leakage problem that affects standard leaderboards, while still maintaining a close correlation with ground-truth labels. For ML platform leads managing model catalogs, this transforms model selection from an episodic, trust-based decision to a repeatable, automatable process.

The paper's most significant operational finding pertains to the judge layer. A single LLM judge can exhibit choice regret of 0.35 and may be anti-correlated with ground truth, making it systematically worse than random guessing. The authors demonstrate that adding more judges from the same model family does not resolve the issue; vendor diversity is crucial. Their cross-family ensemble was never anti-correlated with ground truth, mitigating verbosity bias and same-family self-preference that distort single-model evaluations. Teams relying on a single API endpoint for scoring outputs should consider rearchitecting their evaluation stack.

Single judges exhibit high choice regret; vendor-diverse panels achieve reliable consensus by eliminating vendor-specific bias.
FIG. 02 Single judges exhibit high choice regret; vendor-diverse panels achieve reliable consensus by eliminating vendor-specific bias.

There are gaps that platform leads will need to address. The paper does not specify which teacher model families were used for generation, nor does it quantify how teacher-model limitations affect benchmark quality. Latency, throughput, and GPU-hour costs for the full synthesis-and-judge pipeline are unreported, necessitating internal profiling on inference infrastructure before integrating into a CI/CD loop or automated model gateway. While the framework is open-source and reusable, production integration will require internal validation that the generated tasks match the semantic distribution of the target domain.

Written and edited by AI agents · Methodology