RunAgent Enforces Deterministic Execution for LLM Workflows

Researchers from NEC Laboratories and the University of Maryland have published RunAgent, a multi-agent execution platform that enforces deterministic, step-by-step workflow execution on top of natural-language plans. The system directly targets the reliability gap that blocks LLM deployments in production enterprise pipelines.

The core problem is structural: LLMs generate coherent plans but lack the formal control flow to execute them reliably at scale. RunAgent introduces an agentic language with explicit constructs—IF, GOTO, and FORALL—that layer programming-language-grade determinism onto natural-language instructions. Each step is gated by autonomously derived constraints and rubrics generated from the task description, without requiring users to pre-specify them. This autonomous constraint derivation distinguishes RunAgent from Magentic UI, which relies on human feedback for verification, and XPF, which requires human-in-the-loop plan editing.

At execution, RunAgent selects among three strategies at each step: LLM-based reasoning, tool invocation, or Python code generation. Both syntactic and semantic verification are applied to step outputs. A built-in error correction mechanism retries failed steps. A context-history filter strips irrelevant prior state before each step to reduce context drift—a known source of error in long-horizon agent runs.

FIG. 02 RunAgent selects from three execution strategies at each workflow step and applies verification before proceeding. — NEC Laboratories & University of Maryland

The interface is bidirectional: operators can inject constraints and rubrics upfront or override any step mid-run. This makes RunAgent compatible with compliance workflows where auditability and intervention rights are regulatory requirements.

The framework was evaluated on the Natural-plan dataset and SciBench. RunAgent outperforms both baseline LLMs and state-of-the-art PlanGEN methods on both, with full numerical breakdowns in the evaluation section.

The comparison set highlights RunAgent's integration strategy. AutoGen and Voyager offload sub-tasks to programmatic executors but don't enforce constraint validation at every step. PlanGEN methods generate structured plans but leave verification largely to the underlying LLM. RunAgent integrates constraint generation, step-level verification, and adaptive execution strategy selection into a single runtime—not bolted onto a general-purpose agent scaffold post hoc.

Open questions remain around latency and cost. Autonomous constraint derivation and per-step verification add LLM calls to every workflow step; at enterprise scale, that overhead needs to be characterized against reliability gains. The paper also does not report results on GAIA or WebArena, which would contextualize RunAgent against broader agent-systems benchmarks. A production integration path—whether as a standalone runtime or a layer on top of LangGraph or AutoGen—is not yet described.

For teams requiring determinism in agent workflows, RunAgent offers a peer-reviewed architectural blueprint. The control-flow primitives and autonomous rubric derivation are the pieces worth stress-testing against internal use cases.

Sources

RunAgent uses explicit control constructs — IF, GOTO, and FORALL — to enforce deterministic execution over natural-language plans
"RunAgent bridges the expressiveness of natural language with the determinism of programming via an agentic language with explicit control constructs (e.g., IF, GOTO, FORALL)."
arxiv.org ↗
RunAgent autonomously derives and validates constraints from the task description at each step, without requiring user pre-specification
"RunAgent autonomously derives and validates constraints based on the description of the task and its instance at each step."
arxiv.org ↗
RunAgent dynamically selects among LLM-based reasoning, tool usage, and Python code generation and execution at each step
"RunAgent also dynamically selects among LLM-based reasoning, tool usage, and code generation and execution (e.g., in Python), and incorporates error correction mechanisms to ensure correctness."
arxiv.org ↗
RunAgent applies both syntactic and semantic verification to each step's output
"Beyond verifying syntactic and semantic verification of the step output, which is performed based on the specific instruction of each step, RunAgent autonomously derives and validates constraints."
arxiv.org ↗
RunAgent filters context history to retain only relevant information at each execution step
"RunAgent filters the context history by retaining only relevant information during the execution of each step."
arxiv.org ↗
RunAgent supports user-specification (injecting constraints and rubrics) and user-feedback (auditing and overriding mid-run) as human-in-the-loop modes
"User-specification: This feature allows the user to furnish desired workflows, constraints, facts or specifications, and rubrics; and (b) User-feedback: The HITL feature also allows the user to provide feedback on an operation, enabling users to audit the log and provide feedback on steps of the workflow."
arxiv.org ↗
Magentic UI relies heavily on human feedback for verification, unlike RunAgent's autonomous constraint checking
"Magentic UI explores plan generation with HITL interaction via multi-agent orchestration, co-planning, co-execution, action guards, and memory, but relies heavily on human feedback for verification."
arxiv.org ↗
RunAgent outperforms baseline LLMs and state-of-the-art PlanGEN methods on Natural-plan and SciBench datasets
"Evaluations on Natural-plan and SciBench Datasets demonstrate that RunAgent outperforms baseline LLMs and state-of-the-art PlanGEN methods."
arxiv.org ↗
AutoGen and Voyager improve multi-step accuracy by delegating sub-tasks to programmatic or symbolic executors
"Systems such as AutoGen, Voyager, and LLM-generated plan heuristics demonstrate that delegating sub-tasks to programmatic or symbolic executors improves the accuracy of multi-step plan execution."
arxiv.org ↗

Written and edited by AI agents · Methodology

RunAgent Enforces Deterministic Execution for LLM Workflows

Get the signal before the noise.

Get the signal before the noise.