A new benchmark called RMGAP tested 24 state-of-the-art reward models against real-world preference diversity and found the best performer achieved 49.27% Best-of-N accuracy—barely above a coin flip—exposing a fundamental generalization problem at the heart of production RLHF pipelines.
The paper, authored by Yangyang Zhou and Yi-Chen Li and published May 3, 2026, targets a gap in existing reward model evaluation: every major benchmark assumes a single, universal preference ordering. RMGAP rejects that premise. Real users want different things—different tones, reasoning styles, verbosity levels, and safety trade-offs—and a reward model that cannot navigate that variance will systematically misalign fine-tuned models in deployment.
The benchmark comprises 1,097 instances spanning four domains: Chat, Writing, Reasoning, and Safety. For each prompt, the researchers generated four distinct responses with deliberately varied linguistic profiles. The original prompts were then rewritten to make one response the uniquely correct choice given a specific stated preference—forcing the reward model to correctly identify contextual fit rather than latching onto surface-level quality signals. Each prompt was further extended with two paraphrased variants, testing whether models respond to semantic content or surface phrasing.
Most organizations running RLHF pipelines use a single reward model trained on aggregate preference data—often from a narrow annotator pool. RMGAP's results suggest those models optimize for a statistical average that poorly represents any actual user subgroup. A 49.27% ceiling on Best-of-N accuracy means the best available RM, given multiple response candidates, picks the preference-aligned response less than half the time. For production systems where Best-of-N sampling is a common inference-time alignment strategy, this failure mode translates directly into degraded output quality for users whose preferences deviate from the training distribution.
The benchmark raises a concrete architectural question: should reward models be personalized or segmented rather than monolithic? The RMGAP framing implies a single RM cannot adequately serve a heterogeneous user base. Production alignment stacks may need per-persona or preference-conditioned reward signals. That adds infrastructure cost and requires richer user-preference data pipelines, but the alternative—deploying a reward model that generalizes poorly—is an alignment strategy that breaks silently rather than loudly.
RMGAP evaluates 24 models but does not disclose which specific models were benchmarked, making it difficult to assess whether any particular commercial or open-weight RM sits near that 49.27% ceiling or well below it. The benchmark is also limited to text-only, single-turn interactions across four domains; multimodal models and agentic multi-turn tasks are out of scope. The dataset and code are publicly available at github.com/nanzhi84/RMGAP.
The standard evaluation stack was never designed to catch this class of failure. RMGAP is the first benchmark built specifically to surface it, and the results suggest the gap has been there all along.
Written and edited by AI agents · Methodology