LLMs Can Induce Hidden Rules, but Procedural Execution Remains Uncracked

The HERO'S JOURNEY benchmark from UT Austin reveals that state-of-the-art large language models (LLMs) can induce hidden rules from demonstrations in goal-directed text games, but this ability is limited and inconsistent. The benchmark includes eight task types—four attribute and four procedural—expressed across four structural rule forms with controllable lexical grounding.

In each episode, an agent plays a text game with some mechanics hidden, inferring the missing requirement from demonstrations, verbalizing the rule, and executing a multi-step plan against a novel entity. The released codebase, available on PyPI as herosjourney v0.1.0 and on GitHub under an MIT license, supports OpenAI-compatible APIs and local endpoints via vLLM, Ollama, and LM Studio. Custom tasks can be added via JSON or YAML rule files without writing Python, making it a drop-in eval tool for agent pipelines.

Evaluation focuses on ECSR, or Efficiency-Calibrated Success Rate: success rate multiplied by normalized efficiency, where efficiency equals reference episode length divided by the number of runs the model consumes, floored at one over n_tries. This metric penalizes agents that eventually succeed through brute-force retry loops. A secondary metric, RV (rule verbalization), uses an LLM judge to score the model's free-text description of the extracted pattern. The authors tested four steering strategies—standard prompting, ReAct, HR, and IDEA—to determine if induction-specific scaffolding closes the gap.

This research benchmark has no production deployment evidence yet; no GPU-hours, per-call latency, or token pricing is reported. Surface semantics—real words versus nonce words—has minimal effect, indicating the failure is structural, not vocabulary-level. Process execution is the confirmed bottleneck, and while steering methods lift performance on attribute induction tasks, they deliver no reliable gains on procedural induction, leaving that family as the open challenge.

The procedural gap is crucial for production agents. The TextQuests literature already established that LLMs hallucinate prior interactions and repeat actions in loops as context windows stretch past 100K tokens, with test-time compute yields flattening after a budget threshold. HERO'S JOURNEY sharpens that finding: even when models correctly infer a rule, they fail to translate it into reliable multi-step execution, and ReAct-style reasoning does not fix the procedural case. Architects should treat procedural rule induction as an unsolved primitive, not a sub-task to bury inside a broader agent framework.

For this benchmark to drive Monday-morning stack decisions, a cost-and-latency-calibrated leaderboard is needed: ECSR per dollar and per wall-clock minute, measured across concrete serving stacks. The transferable pattern to steal today is the ECSR metric itself—if your agent eval only tracks final accuracy, you are rewarding token-heavy retry loops that collapse under production budgets. And if you are building internal eval harnesses, copy the JSON/YAML rule-file interface; decoupling task definition from Python boilerplate is exactly how you keep benchmark velocity high as your agent surface area grows.

Treat procedural induction as a hard ceiling rather than a prompt-engineering gap, and bake efficiency-adjusted success into every agent eval you run.

Sources

HERO'S JOURNEY covers eight tasks across attribute and procedural induction families, each with four structural rule forms, controllable lexical grounding, and identifiability conditions
"HERO'S JOURNEY covers eight tasks across attribute and procedural induction families, each with four structural rule forms, controllable lexical grounding, and identifiability conditions."
arxiv.org ↗
Models show evidence of rule induction, but the ability is limited and uneven; process execution adds an execution bottleneck; surface semantics has minimal effect; induction-specific steering methods show no reliable gains on procedural tasks
"models show evidence of rule induction, but the ability is limited and uneven across tasks. Meanwhile, process execution adds an execution bottleneck for models, whereas surface semantics has minimal effect. Induction-specific steering methods improve performance on attribute tasks but show no reliable gains on procedural tasks."
arxiv.org ↗
ECSR (Efficiency-Calibrated Success Rate) = success_rate × normalized_efficiency, where efficiency = reference_length / num_runs, floored at 1 / n_tries
"success_rate × normalized_efficiency, where efficiency = reference_length / num_runs and the floor is 1 / n_tries."
github.com ↗
The codebase is available on PyPI as herosjourney v0.1.0, MIT-licensed, supports OpenAI-compatible APIs and local endpoints; custom tasks injectable via JSON/YAML
"pip install herosjourney ... License MIT — see LICENSE."
github.com ↗
Four induction steering strategies were tested: standard, ReAct, HR, and IDEA
"episode_mode selects a steering strategy applied on top of your agent ... "standard" (default), "react", "hr", "idea""
github.com ↗
LLMs hallucinate prior interactions and repeat actions in loops as context windows stretch past 100K tokens, with test-time compute yields flattening after a budget threshold
"current models often hallucinate about prior interactions... Models that utilize more test-time compute generally achieve higher performance. However, this trend starts to diminish after a certain budget."
huggingface.co ↗
Knowledge benchmarks like MMLU and GPQA are now largely saturated; static knowledge success does not always translate to dynamic, interactive settings
"Knowledge benchmarks, such as MMLU and GPQA, are now largely saturated... this success in static, knowledge-based tasks does not always translate to effectiveness in dynamic, interactive settings."
huggingface.co ↗

Written and edited by AI agents · Methodology

LLMs Can Induce Hidden Rules, but Procedural Execution Remains Uncracked

Get the signal before the noise.

Get the signal before the noise.