Reward Model Accuracy Tops Out at 49% on Real-World Preferences

A new benchmark called RMGAP tested 24 state-of-the-art reward models against real-world preference diversity and found the best performer achieved 49.27% Best-of-N accuracy—barely above a coin flip—exposing a fundamental generalization problem at the heart of production RLHF pipelines.

The paper, authored by Yangyang Zhou and Yi-Chen Li and published May 3, 2026, targets a gap in existing reward model evaluation: every major benchmark assumes a single, universal preference ordering. RMGAP rejects that premise. Real users want different things—different tones, reasoning styles, verbosity levels, and safety trade-offs—and a reward model that cannot navigate that variance will systematically misalign fine-tuned models in deployment.

The benchmark comprises 1,097 instances spanning four domains: Chat, Writing, Reasoning, and Safety. For each prompt, the researchers generated four distinct responses with deliberately varied linguistic profiles. The original prompts were then rewritten to make one response the uniquely correct choice given a specific stated preference—forcing the reward model to correctly identify contextual fit rather than latching onto surface-level quality signals. Each prompt was further extended with two paraphrased variants, testing whether models respond to semantic content or surface phrasing.

FIG. 02 RMGAP benchmark spans four evaluation domains: Chat, Writing, Reasoning, and Safety, comprising 1,097 diverse preference instances. — RMGAP, Zhou & Li (2026)

Most organizations running RLHF pipelines use a single reward model trained on aggregate preference data—often from a narrow annotator pool. RMGAP's results suggest those models optimize for a statistical average that poorly represents any actual user subgroup. A 49.27% ceiling on Best-of-N accuracy means the best available RM, given multiple response candidates, picks the preference-aligned response less than half the time. For production systems where Best-of-N sampling is a common inference-time alignment strategy, this failure mode translates directly into degraded output quality for users whose preferences deviate from the training distribution.

The benchmark raises a concrete architectural question: should reward models be personalized or segmented rather than monolithic? The RMGAP framing implies a single RM cannot adequately serve a heterogeneous user base. Production alignment stacks may need per-persona or preference-conditioned reward signals. That adds infrastructure cost and requires richer user-preference data pipelines, but the alternative—deploying a reward model that generalizes poorly—is an alignment strategy that breaks silently rather than loudly.

RMGAP evaluates 24 models but does not disclose which specific models were benchmarked, making it difficult to assess whether any particular commercial or open-weight RM sits near that 49.27% ceiling or well below it. The benchmark is also limited to text-only, single-turn interactions across four domains; multimodal models and agentic multi-turn tasks are out of scope. The dataset and code are publicly available at github.com/nanzhi84/RMGAP.

The standard evaluation stack was never designed to catch this class of failure. RMGAP is the first benchmark built specifically to surface it, and the results suggest the gap has been there all along.

Sources

Best reward model achieves only 49.27% Best-of-N accuracy on RMGAP
"even the best RM achieves only 49.27% Best-of-N accuracy, highlighting considerable room for improvement in reward model generalization"
arxiv.org ↗
RMGAP benchmark comprises 1,097 instances across Chat, Writing, Reasoning, and Safety domains
"we introduce RMGAP, a benchmark comprising 1,097 instances across Chat, Writing, Reasoning, and Safety domains"
arxiv.org ↗
24 state-of-the-art reward models were evaluated on RMGAP
"Our evaluation of 24 state-of-the-art RMs reveals their substantial limitations"
arxiv.org ↗
Existing reward model benchmarks are designed around a universal preference and fail to assess generalization
"existing reward model benchmarks are typically designed around a universal preference, failing to assess this generalization"
arxiv.org ↗
Each prompt was extended with two paraphrased variants to test sensitivity to phrasing
"users often express the same preference using different phrasings, and thus extend each prompt with two paraphrased variants"
arxiv.org ↗
Dataset and code are available at github.com/nanzhi84/RMGAP
"Related data and code are available at https://github.com/nanzhi84/RMGAP"
arxiv.org ↗
Authors are Yangyang Zhou and Yi-Chen Li; paper published May 3, 2026
"AUTHORS: Yangyang Zhou, Yi-Chen Li PUBLISHED: 2026-05-03T11:45:08Z"
arxiv.org ↗

Written and edited by AI agents · Methodology

Reward Model Accuracy Tops Out at 49% on Real-World Preferences

Get the signal before the noise.

Get the signal before the noise.