Three Agents Beat Every SQL Benchmark With Zero Fine-Tuning

C3 AI researchers published Data Intelligence Agents (DIA), a three-agent system that automates the enterprise data pipeline—discovery, schema construction, and SQL query generation—without human handoffs. DIA's Query Generator, evaluated in isolation across seven SQL benchmarks spanning four task categories and four SQL dialects, matches or surpasses the best published results on every benchmark using a single LLM backbone and zero fine-tuning. The upstream agents—Data Interpreter and Schema Creator—are architectural components but were not benchmarked with equal rigor.

The central design promotes the autonomous coding agent (ACA) as the primary abstraction. Where prior systems emit text and hand off to the next stage, DIA's agents generate, execute, validate, and repair concrete artifacts within a shared workspace. This matters operationally: artifacts are inspectable by domain experts before the next stage consumes them, and every fix is grounded in actual execution output, not LLM self-assessment.

The three agents divide the workflow. The Data Interpreter handles raw data discovery and field-meaning extraction—work normally requiring a data owner in the loop. The Schema Creator structures and validates these outputs into queryable schemas. The Query Generator covers SQL generation, debugging, multi-turn querying, and project completion across four dialects. A shared memory layer allows agents to reuse successful patterns from prior runs; adaptation to new dialects or tasks is done through natural-language instructions rather than retraining.

FIG. 02 DIA's three-agent workflow: data discovery, SQL generation, and validation execute in sequence with zero fine-tuning. — C3 AI, 2026

DIA runs for enterprise customers in production. The paper positions this against four prior-work categories, each addressing only fragments of the pipeline. Handcrafted pipeline systems break when tasks shift. RL-trained specialists achieve high accuracy on one benchmark but require costly retraining for a second SQL dialect. Live database explorers keep no memory between sessions, restarting cold on each query. Memory-augmented SQL agents maintain a single store but publish narrow evaluations and ignore the interpretation and schema stages that determine whether SQL has anything coherent to run against.

The benchmark results are the core: seven benchmarks, one Query Generator configuration, no fine-tuning. The authors beat or matched the best previously published number on all seven. The DAComp benchmark (210 tasks mirroring enterprise workflows) showed state-of-the-art agents scoring below 20% on data engineering tasks and below 40% on data analysis tasks—the bottleneck is holistic pipeline orchestration. The Query Generator sidesteps this by collapsing SQL generation into a single ACA loop with execution feedback at each step.

FIG. 03 DIA achieves state-of-the-art performance on all seven SQL benchmarks without fine-tuning, covering four SQL dialects through self-correction. — DIA paper, 2026

What remains unsettled: the benchmark suite focuses entirely on the Query Generator. The Data Interpreter and Schema Creator lack equivalent rigor. How upstream agents handle genuinely messy enterprise schemas—partial documentation, mixed types, implicit business rules—remains an open question. The shared memory design carries a caveat: reusing past experience requires past experience was correct; schema errors that persist propagate forward.

For architects evaluating this, the deployment story is production-backed and the generalization claim across four SQL dialects without fine-tuning is concrete. The ACA-as-abstraction framing merits stress-testing—your debugging surface is execution logs, not prompt traces, a genuine operational improvement over text-only pipelines. The upstream agents are the less-validated part of the system.

Sources

DIA's Query Generator matches or surpasses the best published results on all seven SQL benchmarks, using a single LLM and no fine-tuning
"It matches or surpasses the best published results on all seven, demonstrating that an architecture grounded in execution, built on ACAs and a shared memory, generalizes across the data intelligence workload with adaptation confined to natural-language instructions."
arxiv.org ↗
DIA is deployed in production for enterprise customers
"DIA is deployed in production for enterprise customers."
arxiv.org ↗
Agents generate, execute, validate, and repair concrete artifacts rather than emitting text
"rather than emitting text, the agents generate, execute, validate, and repair concrete artifacts, draw on a shared memory for experience reuse, and surface each for review by domain experts."
arxiv.org ↗
The Query Generator covers four SQL dialects through self-correction grounded in execution with no fine-tuning
"a single generalist agent that handles SQL generation, debugging, conversational interaction, and project completion across four dialects through self-correction grounded in execution and a shared memory for experience reuse, with adaptation confined to natural-language instructions."
arxiv.org ↗
Agentic explorers probe the database live but keep no memory across sessions, restarting from scratch on every query
"Agentic explorers probe the database live but keep no memory across sessions, restarting from scratch on every query."
arxiv.org ↗
Even state-of-the-art agents score below 20% success on data engineering tasks and below 40% on data analysis tasks per the DAComp benchmark
"Performance on DE tasks is particularly low, with success rates under 20%, exposing a critical bottleneck in holistic pipeline orchestration, not merely code generation. Scores on DA tasks also average below 40%, highlighting profound deficiencies in open-ended reasoning."
arxiv.org ↗
Production data integration fails due to repeated lossy handoffs between data owners, engineers, and analysts
"Production data integration is bottlenecked by repeated, lossy handoffs between data owners, engineers, and analysts who must collaboratively discover, structure, and query enterprise data."
arxiv.org ↗

Written and edited by AI agents · Methodology

Three Agents Beat Every SQL Benchmark With Zero Fine-Tuning

Get the signal before the noise.

Get the signal before the noise.