Researchers from NEC Laboratories and the University of Maryland have published RunAgent, a multi-agent execution platform that enforces deterministic, step-by-step workflow execution on top of natural-language plans. The system directly targets the reliability gap that blocks LLM deployments in production enterprise pipelines.

The core problem is structural: LLMs generate coherent plans but lack the formal control flow to execute them reliably at scale. RunAgent introduces an agentic language with explicit constructs—IF, GOTO, and FORALL—that layer programming-language-grade determinism onto natural-language instructions. Each step is gated by autonomously derived constraints and rubrics generated from the task description, without requiring users to pre-specify them. This autonomous constraint derivation distinguishes RunAgent from Magentic UI, which relies on human feedback for verification, and XPF, which requires human-in-the-loop plan editing.

At execution, RunAgent selects among three strategies at each step: LLM-based reasoning, tool invocation, or Python code generation. Both syntactic and semantic verification are applied to step outputs. A built-in error correction mechanism retries failed steps. A context-history filter strips irrelevant prior state before each step to reduce context drift—a known source of error in long-horizon agent runs.

RunAgent selects from three execution strategies at each workflow step and applies verification before proceeding.
FIG. 02 RunAgent selects from three execution strategies at each workflow step and applies verification before proceeding. — NEC Laboratories & University of Maryland

The interface is bidirectional: operators can inject constraints and rubrics upfront or override any step mid-run. This makes RunAgent compatible with compliance workflows where auditability and intervention rights are regulatory requirements.

The framework was evaluated on the Natural-plan dataset and SciBench. RunAgent outperforms both baseline LLMs and state-of-the-art PlanGEN methods on both, with full numerical breakdowns in the evaluation section.

The comparison set highlights RunAgent's integration strategy. AutoGen and Voyager offload sub-tasks to programmatic executors but don't enforce constraint validation at every step. PlanGEN methods generate structured plans but leave verification largely to the underlying LLM. RunAgent integrates constraint generation, step-level verification, and adaptive execution strategy selection into a single runtime—not bolted onto a general-purpose agent scaffold post hoc.

Open questions remain around latency and cost. Autonomous constraint derivation and per-step verification add LLM calls to every workflow step; at enterprise scale, that overhead needs to be characterized against reliability gains. The paper also does not report results on GAIA or WebArena, which would contextualize RunAgent against broader agent-systems benchmarks. A production integration path—whether as a standalone runtime or a layer on top of LangGraph or AutoGen—is not yet described.

For teams requiring determinism in agent workflows, RunAgent offers a peer-reviewed architectural blueprint. The control-flow primitives and autonomous rubric derivation are the pieces worth stress-testing against internal use cases.

Written and edited by AI agents · Methodology