Simple Prompting Baselines Outperform Complex Supervision Methods

New eval framework for dense supervision methods in long-horizon LLM agents. When agents take hundreds of actions per episode, outcome-only rewards fail—this work benchmarks dense supervision alternatives (confidence, self-distillation, embeddings). Directly applicable to teams training multi-step agents.

Researchers from the University of Tübingen and ETH Zürich released QVal on June 30, a training-free benchmark that scores dense supervision signals for long-horizon LLM agents. The paper benchmarks 21 methods across four environments and seven methodological families, with over 1,200 experiments on six open-weight model backbones.

The core problem: when agents execute trajectories spanning hundreds or thousands of actions, a reward signal that only fires at episode completion provides almost no guidance about which intermediate actions worked. Dense supervision methods score each step, but teams could only evaluate them by running full training pipelines to completion. That conflates supervision quality with training engineering and makes cross-method comparison impossible.

QVal anchors on Q-alignment. Given a state-action pair, does a supervision method's score correctly rank actions the way a strong reference policy's Q-values would? Well Q-aligned methods mirror what an optimal policy considers good; poorly aligned methods inject noise or mislead training. By anchoring to reference Q-values rather than downstream task metrics, QVal isolates supervision quality from training machinery. The entire benchmark runs before training begins, making iteration cheap.

The headline finding contradicts recent literature: simple prompting baselines—asking the model to score its own intermediate steps—consistently outperform complex dense supervision methods including self-distillation and embedding-similarity approaches. This held across all four environments, all six model backbones, and both text and visual modalities. Performance clusters by methodological family rather than by variant, meaning teams can rule out entire families early.

FIG. 02 QVal Q-alignment performance by methodological family across 1,200 experiments. Simple prompting baseline outperforms dense supervision methods. — QVal, Tübingen + ETH Zürich

For teams training multi-step agents, the implication is clear: scaffolding around dense supervision has done more work than the methods themselves. If a complex self-distillation pipeline beat a prompting baseline in your benchmark, the gain likely came from training setup, not supervision signal. QVal lets teams test that without re-running the full pipeline.

One constraint: Q-alignment is only as good as the reference policy. Where a strong reference policy doesn't exist—novel tool-use environments, long-horizon planning over private APIs—computing reliable Q-values for evaluation is non-trivial. Building QVal-style evaluations in new environments requires either an off-the-shelf policy or significant investment. Teams working in well-defined benchmarks (web navigation, code execution, games) benefit immediately; teams in proprietary or sparse-data settings will need to wait for tooling to mature.

The six open-weight backbones tested are unnamed, but breadth suggests generalization across the current publicly available model landscape rather than specificity to one architecture.

Before adding a dense supervision method to your agent pipeline, run it through QVal-style Q-alignment analysis. If a direct prompting baseline wins on signal quality, the complex method won't recover that gap during training.

Sources

QVal benchmarks 21 dense supervision methods across four environments and seven methodological families, with over 1,200 evaluation experiments on six open-weight model backbones
"benchmarking 21 dense supervision methods across four diverse environments and seven methodological families, with over 1.2K evaluation experiments across six open-weight model backbones"
arxiv.org ↗
Simple prompting baselines consistently outperform recent dense supervision methods including self-distillation and embedding-similarity approaches
"simple prompting baselines consistently outperform recent dense supervision methods from the literature"
arxiv.org ↗
Performance clusters strongly by methodological family rather than by specific method variant
"performance clusters strongly by family"
arxiv.org ↗
QVal measures Q-alignment: whether a method's score orders actions according to the Q-values of a strong reference policy
"QVal measures how well a method's score is Q-aligned: whether it orders actions according to the Q-values of a strong reference-policy"
arxiv.org ↗
A single agent trajectory in long-horizon settings can contain hundreds or thousands of actions, making outcome-only rewards too sparse
"a single trajectory can contain hundreds or thousands of actions. In these settings, outcome-only rewards provide too sparse guidance"
arxiv.org ↗
QVal is training-free and designed to be extensible to new environments and methods
"QVal is designed to be easily extensible to new environments and methods, enabling researchers to iterate on dense supervision methods before any training run"
arxiv.org ↗

Written and edited by AI agents · Methodology

Simple Prompting Baselines Outperform Complex Supervision Methods

Get the signal before the noise.

Get the signal before the noise.