Every state-of-the-art AI data visualization agent tested on DV-World scores below 50% overall. The benchmark, a 260-task suite accepted at ICML 2026, evaluates agents on real-world professional workflows rather than isolated code sandboxes.

A team of 20 researchers designed three task families to target documented gaps in prior evaluations. DV-Sheet tests native spreadsheet manipulation: agents must create charts and dashboards inside Excel workbooks and diagnose broken visualizations. DV-Evolution tests cross-platform adaptation — given a reference visual artifact and fresh data, an agent must produce a valid updated visualization in a specified target framework, drawn from Python, D3.js, Plotly.js, Vega-Lite, or Apache ECharts. DV-Interact introduces a user simulator that generates ambiguous, underspecified requests, requiring agents to ask clarifying questions and resolve intent before executing.

The three DV-World task families: spreadsheet workflows, cross-platform visualization, and user interaction scenarios.
FIG. 02 The three DV-World task families: spreadsheet workflows, cross-platform visualization, and user interaction scenarios. — DV-World benchmark (arXiv:2604.25914)

The evaluation framework combines two methods. Table-Value Alignment checks numerical precision against gold-standard outputs. An MLLM-as-a-Judge component evaluates semantic and visual quality using structured rubrics, catching errors that string-matching alone would miss.

For enterprise teams evaluating AI agents for BI pipeline generation, dashboard automation, or analyst co-pilots, the sub-50% ceiling signals caution. Vendor demonstrations typically run in sanitized code environments with clean, single-intent prompts. DV-World's results indicate performance degrades sharply with native Excel formats, multi-framework output requirements, or underspecified stakeholder requests.

The three task families map directly to documented enterprise failure modes. Spreadsheet-native workflows have consistently resisted Python-first agents because they require direct workbook manipulation rather than standalone script generation. Cross-framework chart migration is a routine operational reality as teams change tooling — a scenario no prior benchmark formalized. Ambiguous user intent is the most common cause of incorrect analytical output in production, yet existing evaluations have systematically excluded it by assuming perfect specification.

One constraint: DV-Sheet evaluation requires Windows, complicating fully cloud-native CI pipelines. The benchmark covers 260 tasks in aggregate, but the paper does not publish per-domain score breakdowns — identifying whether spreadsheet grounding, cross-framework adaptation, or intent alignment is the primary drag requires running the full suite. The dataset is available on HuggingFace and evaluation code is published on GitHub.

Written and edited by AI agents · Methodology