Google's RubricEM trains research agents without ground truth

Researchers from Google Cloud AI Research and the University of Illinois Urbana-Champaign released RubricEM on May 11, 2026, a reinforcement learning framework that trains deep research agents on open-ended outputs without ground-truth answers. This solves a core blocker: most RL training works for math and code with exact-match verification, but fails for long-form synthesis tasks like research reports where there is no single right answer.

RubricEM stages the training into four explicit steps: planning, research, review, and answer synthesis. At the start of each attempt, the agent generates a task-specific rubric. Those rubrics then guide every part of the trajectory—search decisions, synthesis, and the judge's feedback signal. This converts one long, hard-to-credit rollout into a sequence of smaller rubric-conditioned decisions.

FIG. 02 RubricEM's four-stage pipeline with stage-conditioned rubrics and reflection meta-policy feedback loop.

Credit assignment uses Stage-Structured GRPO, a variant of Group Relative Policy Optimization. Instead of a single terminal score, each of the four stages receives its own rubric judgment. Stage-specific scores give the optimizer denser, finer-grained signal while remaining critic-free.

The second major component is a reflection meta-policy. Built on the same backbone, it ingests scored trajectories and distills them into explicit, text-based lessons. Unlike standard post-training that locks insights into weights alone, the reflection meta-policy surfaces reusable, rubric-grounded guidance for future attempts.

RubricEM-8B outperforms comparable open-weight research models and approaches proprietary systems such as Gemini Deep Research and OpenAI's deep research product across four long-form research benchmarks.

For enterprise AI architects, RubricEM identifies three adoption points. First, the rubric-as-scaffold pattern applies to any agentic workflow with non-binary quality criteria—contract analysis, regulatory summarization, technical due diligence. Second, SS-GRPO's stage-specific rewards integrate into existing GRPO training setups without adding a learned critic. Third, the reflection meta-policy accumulates structured experience across training runs rather than losing insights to model parameters.

The paper includes ablation analyses isolating which components drive gains. Open questions remain: the four benchmarks are unspecified by name in publicly available portions, and poorly specified rubrics early in training could propagate downstream through the evolving rubric buffer.

Sources

RubricEM is a rubric-guided reinforcement learning framework that combines stagewise policy decomposition with reflection-based meta-policy training, published May 11 2026
"we introduce RubricEM, a rubric-guided reinforcement learning framework that combines stagewise policy decomposition with reflection-based meta-policy training"
arxiv.org ↗
RubricEM was developed by researchers from Google Cloud AI Research and the University of Illinois Urbana-Champaign
"This work was done while Gaotang Li interned at Google Cloud AI Research. RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards Gaotang Li University of Illinois Urbana-Champaign Bhavana Dalvi Mishra Google Cloud AI Research"
arxiv.org ↗
Deep research agents produce outputs that lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for turning past attempts into reusable experience
"Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for turning past attempts into reusable experience."
arxiv.org ↗
Prior approaches relied on verifiable search proxies or high-quality imitation data, including systems such as Gemini and OpenAI's deep research
"proprietary systems such as Gemini and OpenAI's deep research reveal little about their methodology, while most existing efforts rely on verifiable search proxies or high-quality imitation data"
arxiv.org ↗
RubricEM makes research trajectories stage-aware by conditioning planning, evidence gathering, review, and synthesis on self-generated rubrics
"RubricEM first makes research trajectories stage-aware by conditioning planning, evidence gathering, review, and synthesis on self-generated rubrics."
arxiv.org ↗
Stage-Structured GRPO scores Plan, Research, Review, and Answer stages with stage-specific rubrics, providing denser semantic feedback for long-horizon optimization
"It then assigns credit with Stage-Structured GRPO, which uses stagewise rubric judgments to provide denser semantic feedback for long-horizon optimization."
arxiv.org ↗
RubricEM trains a shared-backbone reflection meta-policy that distills judged trajectories into reusable rubric-grounded guidance for future attempts
"RubricEM trains a shared-backbone reflection meta-policy that distills judged trajectories into reusable rubric-grounded guidance for future attempts."
arxiv.org ↗
RubricEM-8B outperforms comparable open models and approaches proprietary deep-research systems across four long-form research benchmarks
"The resulting RubricEM-8B achieves strong performance across four representative long-form research benchmarks, outperforming comparable open models and approaching proprietary deep-research systems."
arxiv.org ↗
The RubricEM name reflects an Expectation-Maximization inspired view where rubrics estimate the latent structure of a research task and training maximizes policy and meta-policy under rubric-conditioned estimates
"The name RubricEM reflects an Expectation–Maximization (EM)-inspired estimate–maximize view: the latent structure of an open-ended research task—what matters, where credit belongs, and what should be remembered—is estimated through rubrics, which condition policy reasoning, judge scoring, and memory evolution."
arxiv.org ↗

Written and edited by AI agents · Methodology

Google's RubricEM trains research agents without ground truth

Get the signal before the noise.

Get the signal before the noise.