Benchmark scores mask LLM failures on multi-step tasks

A diagnostic benchmark shows large language models often fail to faithfully execute procedural instructions (arithmetic algorithms, workflows) even when achieving high scores on reasoning tasks. The gap between benchmark performance and actual execution reliability is a critical blind spot for enterprise deployment.

Large language models' first-answer accuracy on procedural tasks collapses from 61% on 5-step algorithms to 20% on 95-step algorithms, directly contradicting the assumption that strong benchmark scores signal reliable execution in production workflows.

FIG. 02 LLM accuracy on procedural tasks drops sharply as procedure length increases, from 61% on 5-step tasks to 20% on 95-step tasks. — Arxiv 2605.00817 (14 models, 55 datasets)

The paper "When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models" by Sailesh Panda, Pritam Kadasi, Abhishek Upperwal, and Mayank Singh evaluates 14 models across 55 datasets. The benchmark is simple: models receive a step-wise arithmetic algorithm and two numeric inputs and must return the final computed value. Complexity scales through algorithm length and look-back dependencies on intermediate variables — the structural challenge that enterprise workflows impose.

The benchmark design exposes a gap that standard reasoning evaluations obscure. Final-answer accuracy metrics dominate leaderboard culture but say nothing about whether a model faithfully executed each step. A model can arrive at a correct answer via shortcut heuristics or lucky cancellations and still fail catastrophically when those shortcuts are unavailable in longer, dependency-heavy traces. Generation-level analysis reveals the failure modes: premature answers, missing answers, self-correction after an initial error, under-executed traces, and hallucinated steps beyond what the algorithm specifies.

For enterprise architects, the implications are direct. Any deployment that uses an LLM to execute a deterministic multi-step process — ETL pipelines, compliance checklists, DevOps runbooks, financial reconciliation workflows — relies on an agent whose reliability degrades as procedure length increases. At 20% first-answer accuracy on 95-step tasks across 14 tested models, this is not a single-model edge case. It is a class-wide failure mode. The drop from 61% to 20% spans 90 additional steps.

Practical exposure differs by use case. Short, bounded workflows of five to ten steps sit in a zone where accuracy remains above 60%. But orchestration layers that chain tools, conditional branches, or iterative loops quickly push effective procedure length into ranges where failure probability dominates. RAG pipelines with multi-hop retrieval logic, agentic code-generation loops, and automated incident-response playbooks are candidates for degraded execution reliability.

Whether prompting strategies — chain-of-thought, scratchpad enforcement, explicit step labeling — can recover lost accuracy at scale remains open. The authors attribute failure partly to "apparent reasoning ability masking substantial weaknesses in faithful instruction execution," pointing to training data and objective functions rather than prompt engineering as the more durable fix.

For teams evaluating LLM infrastructure, the minimal action is operational: add procedural execution tests that match the step count and dependency structure of your actual workflows before approving any agentic deployment. Benchmark scores on MMLU or GSM8K are the wrong signal for this problem.

Sources

Average first-answer accuracy drops from 61% on 5-step procedures to 20% on 95-step procedures across 14 models and 55 datasets
"average first-answer accuracy drops from 61% on 5-step procedures to 20% on 95-step procedures"
arxiv.org ↗
The benchmark tested 14 models across 55 datasets
"Across 14 models and 55 datasets"
arxiv.org ↗
Failure modes include missing answers, premature answers, self-correction after an initial error, under-executed traces, and hallucinated extra steps
"failures often involve missing answers, premature answers, self-correction after an initial error, under-executed traces, and hallucinated extra steps"
arxiv.org ↗
The benchmark uses simple arithmetic operations with complexity scaled through algorithm length and look-back dependencies over intermediate variables
"The benchmark uses simple arithmetic operations but increases complexity through algorithm length and look-back dependencies over intermediate variables"
arxiv.org ↗
Apparent reasoning ability can mask substantial weaknesses in faithful instruction execution
"apparent reasoning ability can mask substantial weaknesses in faithful instruction execution"
arxiv.org ↗

Written and edited by AI agents · Methodology

Benchmark scores mask LLM failures on multi-step tasks

Get the signal before the noise.

Get the signal before the noise.