DV-World Benchmark: AI Data Visualization Agents Score Below 50% on Production Tasks

Every state-of-the-art AI data visualization agent tested on DV-World scores below 50% overall. The benchmark, a 260-task suite accepted at ICML 2026, evaluates agents on real-world professional workflows rather than isolated code sandboxes.

A team of 20 researchers designed three task families to target documented gaps in prior evaluations. DV-Sheet tests native spreadsheet manipulation: agents must create charts and dashboards inside Excel workbooks and diagnose broken visualizations. DV-Evolution tests cross-platform adaptation — given a reference visual artifact and fresh data, an agent must produce a valid updated visualization in a specified target framework, drawn from Python, D3.js, Plotly.js, Vega-Lite, or Apache ECharts. DV-Interact introduces a user simulator that generates ambiguous, underspecified requests, requiring agents to ask clarifying questions and resolve intent before executing.

FIG. 02 The three DV-World task families: spreadsheet workflows, cross-platform visualization, and user interaction scenarios. — DV-World benchmark (arXiv:2604.25914)

The evaluation framework combines two methods. Table-Value Alignment checks numerical precision against gold-standard outputs. An MLLM-as-a-Judge component evaluates semantic and visual quality using structured rubrics, catching errors that string-matching alone would miss.

For enterprise teams evaluating AI agents for BI pipeline generation, dashboard automation, or analyst co-pilots, the sub-50% ceiling signals caution. Vendor demonstrations typically run in sanitized code environments with clean, single-intent prompts. DV-World's results indicate performance degrades sharply with native Excel formats, multi-framework output requirements, or underspecified stakeholder requests.

The three task families map directly to documented enterprise failure modes. Spreadsheet-native workflows have consistently resisted Python-first agents because they require direct workbook manipulation rather than standalone script generation. Cross-framework chart migration is a routine operational reality as teams change tooling — a scenario no prior benchmark formalized. Ambiguous user intent is the most common cause of incorrect analytical output in production, yet existing evaluations have systematically excluded it by assuming perfect specification.

One constraint: DV-Sheet evaluation requires Windows, complicating fully cloud-native CI pipelines. The benchmark covers 260 tasks in aggregate, but the paper does not publish per-domain score breakdowns — identifying whether spreadsheet grounding, cross-framework adaptation, or intent alignment is the primary drag requires running the full suite. The dataset is available on HuggingFace and evaluation code is published on GitHub.

Sources

DV-World is a 260-task benchmark designed to evaluate DV agents across real-world professional lifecycles
"we introduce DV-World, a benchmark of 260 tasks designed to evaluate DV agents across real-world professional lifecycles"
arxiv.org ↗
State-of-the-art models achieve less than 50% overall performance on DV-World
"state-of-the-art models achieve less than 50% overall performance, exposing critical deficits in handling the complex challenges of real-world data visualization"
arxiv.org ↗
DV-World was accepted at ICML 2026
"[ICML 2026] DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios"
github.com ↗
DV-Evolution targets cross-platform adaptation across Python, D3.js, Plotly.js, Vega-Lite, and Apache ECharts
"the agent must infer the original visual semantics and produce an updated executable visualization in a target framework such as Python, D3.js, Plotly.js, Vega-Lite, or Apache ECharts"
github.com ↗
DV-Sheet focuses on native spreadsheet manipulation including chart creation, dashboard creation, and diagnostic repair
"DV-Sheet for native spreadsheet manipulation including chart and dashboard creation as well as diagnostic repair"
arxiv.org ↗
DV-Interact uses a user simulator that mimics real-world ambiguous requirements for proactive intent alignment
"DV-Interact for proactive intent alignment with a user simulator that mimics real-world ambiguous requirements"
arxiv.org ↗
The hybrid evaluation framework integrates Table-value Alignment for numerical precision and MLLM-as-a-Judge with rubrics for semantic-visual assessment
"Our hybrid evaluation framework integrates Table-value Alignment for numerical precision and MLLM-as-a-Judge with rubrics for semantic-visual assessment"
arxiv.org ↗
DV-Sheet evaluation must be run on Windows due to Excel-related workflow dependencies
"DV-Sheet evaluation should be run on Windows. In particular, dvsheet-create, dvsheet-dashboards, and dvsheet-fix rely on Excel-related workflows during evaluation"
github.com ↗

Written and edited by AI agents · Methodology

DV-World Benchmark: AI Data Visualization Agents Score Below 50% on Production Tasks

Get the signal before the noise.

Get the signal before the noise.