RuDE Predicts Fine-Tuning Success Without Training

RuDE (Rubric-based Discriminative Evaluation) predicts a base language model's post-training performance before fine-tuning begins, achieving 90%-plus correlation with actual outcomes across tested models. Eight researchers—Xiaoyuan Li, Yubo Ma, Kexin Yang, Moxin Li, Keqin Bao, Wenie Wang, Fuli Feng, and Dayiheng Liu—posted the method to arXiv on May 12, 2026.

Standard benchmarks like MMLU fail to capture model plasticity in open-ended tasks. Enterprise teams select foundation models based on benchmark scores, then discover mid-project that highly-ranked base models respond poorly to instruction tuning or reinforcement learning. RuDE eliminates this discovery cycle by reframing evaluation as a discrimination task: it presents a base model with paired responses and asks it to identify which satisfies a detailed rubric. The model's discriminative accuracy, not generation quality, becomes the predictive signal—sidestepping the "generation gap" that base models introduce when forced to follow output format constraints before instruction-tuning.

FIG. 02 RuDE achieves >90% correlation with post-training performance, substantially outperforming traditional benchmarks like MMLU. — ai|expert research

The method constructs contrastive pairs using the 4C Taxonomy, a framework that categorizes rubric violations across domains. Each pair has one response that subtly violates a criterion and one that does not. By varying violation types and domains, RuDE produces a composite score that predicts post-training performance.

Validation via reinforcement learning showed that RuDE identifies smaller models with high post-training potential that outperform larger models by parameter count. For enterprise teams, that translates to lower inference costs, easier edge deployment, and faster iteration.

Currently, evaluating three or four candidate models for a domain-specific application requires running full fine-tuning jobs on each—consuming weeks of GPU time. RuDE compresses that evaluation to hours before any gradient updates.

The authors did not publicly release an implementation at posting, nor enumerated the full set of model families tested. Generalization to multimodal or code-specialized models remains unvalidated.

If 90%-plus correlation holds across a broad model zoo, RuDE can become a standard pre-selection gate in enterprise procurement pipelines.

Sources

RuDE achieves correlation greater than 90% with post-training performance
"Extensive experiments demonstrate a correlation greater than 90% with post-training performance."
arxiv.org ↗
Traditional benchmarks like MMLU fail to reflect a base model's plasticity in complex open-ended scenarios
"traditional benchmarks like MMLU often fail to reflect a base model's plasticity in complex open-ended scenarios, leading to inefficient model selection."
arxiv.org ↗
RuDE bypasses the generation gap of base models by leveraging response discrimination rather than generation
"a unified framework that bypasses the generation gap of base models by leveraging response discrimination."
arxiv.org ↗
RuDE uses the 4C Taxonomy to construct controlled contrastive pairs via fine-grained rubric violations across diverse domains
"Guided by our systematic 4C Taxonomy, RuDE constructs controlled contrastive pairs across diverse domains by fine-grained rubric violations."
arxiv.org ↗
RL validation confirms RuDE identifies high-potential smaller models that outperform larger counterparts
"validation via Reinforcement Learning (RL) confirms that RuDE effectively identifies high-potential smaller models that outperform larger counterparts, offering a compute-efficient mechanism for foundation model development."
arxiv.org ↗

Written and edited by AI agents · Methodology

RuDE Predicts Fine-Tuning Success Without Training

Get the signal before the noise.

Get the signal before the noise.