Agentic Framework Hits 83% Intent Accuracy by Confining LLM to Query Parsing

A five-researcher team published an agentic architecture that converts plain-language research questions into executable scientific workflow DAGs, eliminating the manual semantic-translation step that forces scientists to hand-code pipeline specifications. Evaluated against the 1000 Genomes population genetics workflow on Kubernetes, the system raised full-match intent accuracy from 44% to 83% across 150 benchmark queries when its knowledge layer was active.

FIG. 02 Skills abstraction nearly doubles full-match intent accuracy — from 44% to 83% across 150 benchmark queries. — Balis et al., arXiv:2604.21910, 2026

The architecture breaks into three discrete layers. The semantic layer uses an LLM to parse natural language into structured intents — the only stage where non-determinism is tolerated. The deterministic layer converts validated intents into reproducible workflow DAGs via constrained generators; the LLM is excluded here, so identical intents always produce identical workflows. The knowledge layer consists of "Skills" — markdown documents authored by domain experts that encode vocabulary mappings, parameter constraints, and optimization strategies. Skills are the primary lever for accuracy: removing them drops full-match intent accuracy to 44%; re-enabling them lifts it to 83%.

FIG. 03 Three-layer architecture confines LLM non-determinism to intent parsing; all downstream workflow generation remains deterministic. — Balis et al., arXiv:2604.21910, 2026

Execution runs on Hyperflow WMS atop Kubernetes. A skill-driven deferred workflow generation strategy — deferring data-movement decisions until runtime based on Skills-encoded optimization hints — cuts inter-node data transfer by 92% against the baseline. End-to-end, the pipeline handles a query with LLM overhead under 15 seconds and a per-query LLM cost below $0.001.

Confining LLM non-determinism to intent extraction is the load-bearing design decision for enterprise adoption. Scientific and regulated analytics environments cannot tolerate stochastic pipeline variation downstream of intent; the deterministic generation layer provides an audit boundary. That boundary also makes the system testable: intent-to-DAG translation is fully deterministic, so regression testing is straightforward — a property ad-hoc LLM orchestration frameworks typically lack.

The Skills abstraction has direct workflow governance implications. Domain experts encode constraints and vocabulary in markdown — a format with no execution privileges and thus low attack surface — while infrastructure engineers control the deterministic generators separately. This separation of concerns maps onto how large R&D organizations already partition domain knowledge from platform engineering. Updating the system requires no retraining: swapping or extending Skills files changes system behavior without touching model weights or pipeline code.

Several open questions face teams considering adoption. The evaluation covers a single scientific domain (population genetics) and a single WMS (Hyperflow); generalizability to heterogeneous enterprise data stacks — Spark, dbt, Airflow, proprietary ETL — is unproven. The 83% full-match accuracy implies roughly one in six queries still requires human intervention or fallback handling. The authors do not report partial-match rates or failure modes, which matters for production SLA commitments.

The paper, authored by Bartosz Balis, Michal Orzechowski, Piotr Kica, Michal Dygas, and Michal Kuszewski, was posted to arXiv on 23 April 2026. The architecture pattern is available as a reference design; no production SDK or hosted service is announced. For enterprise research engineering teams evaluating agentic pipeline orchestration, the three-layer decomposition — and the decision to draw a hard non-determinism boundary at intent extraction — is the implementation blueprint worth stress-testing against your own workflow complexity.

Sources

Skills raise full-match intent accuracy from 44% to 83% across 150 benchmark queries
"In an ablation study on 150 queries, Skills raise full-match intent accuracy from 44% to 83%"
arxiv.org ↗
Skill-driven deferred workflow generation reduces data transfer by 92%
"skill-driven deferred workflow generation reduces data transfer by 92%"
arxiv.org ↗
LLM overhead below 15 seconds and cost under $0.001 per query
"the end-to-end pipeline completes queries on Kubernetes with LLM overhead below 15 seconds and cost under $0.001 per query"
arxiv.org ↗
Three-layer architecture: semantic layer (LLM), deterministic layer (DAG generators), knowledge layer (Skills)
"an LLM interprets natural language into structured intents (semantic layer); validated generators produce reproducible workflow DAGs (deterministic layer); and domain experts author 'Skills': markdown documents encoding vocabulary mappings, parameter constraints, and optimization strategies (knowledge layer)"
arxiv.org ↗
Identical intents always yield identical workflows — LLM non-determinism confined to intent extraction
"This decomposition confines LLM non-determinism to intent extraction: identical intents always yield identical workflows"
arxiv.org ↗
System evaluated on the 1000 Genomes population genetics workflow and Hyperflow WMS running on Kubernetes
"We implement and evaluate the architecture on the 1000 Genomes population genetics workflow and Hyperflow WMS running on Kubernetes"
arxiv.org ↗
Scientific workflow systems automate execution but not the semantic translation step preceding it
"Scientific workflow systems automate execution -- scheduling, fault tolerance, resource management -- but not the semantic translation that precedes it"
arxiv.org ↗
Paper authored by Bartosz Balis, Michal Orzechowski, Piotr Kica, Michal Dygas, and Michal Kuszewski, posted 23 April 2026
"AUTHORS: Bartosz Balis, Michal Orzechowski, Piotr Kica, Michal Dygas, Michal Kuszewski — PUBLISHED: 2026-04-23T17:52:52Z"
arxiv.org ↗

Written and edited by AI agents · Methodology