Real-world data visualization (DV) requires native environmental grounding, cross-platform evolution, and proactive intent alignment. Yet, existing benchmarks often suffer from code-sandbox confinement, single-language creation-only tasks, and assumption of perfect intent. To bridge these gaps, we introduce DV-World, a benchmark of 260 tasks designed to evaluate DV agents across real-world professional lifecycles. DV-World spans three domains: DV-Sheet for native spreadsheet manipulation including chart and dashboard creation as well as diagnostic repair; DV-Evolution for adapting and restructuring reference visual artifacts to fit new data across diverse programming paradigms and DV-Interact for proactive intent alignment with a user simulator that mimics real-world ambiguous requirements. Our hybrid evaluation framework integrates Table-value Alignment for numerical precision and MLLM-as-a-Judge with rubrics for semantic-visual assessment. Experiments reveal that state-of-the-art models achieve less than 50% overall performance, exposing critical deficits in handling the complex challenges of real-world data visualization. DV-World provides a realistic testbed to steer development toward the versatile expertise required in enterprise workflows. Our data and code are available at \href{https://github.com/DA-Open/DV-World}{this project page}.
翻译:真实世界的数据可视化(DV)需要本机环境支撑、跨平台演化能力及主动意图对齐。然而,现有基准测试常受限于代码沙盒环境、仅支持单语言创建型任务,并预设完美意图假设。为填补这些空白,我们提出DV-World——一个包含260项任务的基准测试,旨在评估数据可视化代理在真实专业工作流程中的表现。DV-World涵盖三大领域:DV-Sheet聚焦本机电子表格操作(含图表仪表盘创建与诊断修复)、DV-Evolution要求跨编程范式调整重构参考可视化制品以适应新数据、DV-Interact通过用户模拟器实现与模糊现实需求的主动意图对齐。我们的混合评估框架整合表值对齐(用于数值精度验证)与基于量规的MLLM-as-a-Judge方法(用于语义-视觉评估)。实验表明,当前最优模型整体性能不足50%,暴露出其在应对真实数据可视化复杂挑战中的关键缺陷。DV-World为引导面向企业工作流所需的多功能专家能力开发提供了逼真的测试平台。数据与代码见项目主页。