Large language models (LLMs) have recently enabled coding agents capable of generating, executing, and revising visualization code. However, existing models often fail in practical workflows due to limited language coverage, unreliable execution, and lack of iterative correction mechanisms. Progress has been constrained by narrow datasets and benchmarks that emphasize single-round generation and single-language tasks. To address these challenges, we introduce three complementary resources for advancing visualization coding agents. VisCode-Multi-679K is a large-scale, supervised dataset containing 679K validated and executable visualization samples with multi-turn correction dialogues across 12 programming languages. VisPlotBench is a benchmark for systematic evaluation, featuring executable tasks, rendered outputs, and protocols for both initial generation and multi-round self-debug. Finally, we present VisCoder2, a family of multi-language visualization models trained on VisCode-Multi-679K. Experiments show that VisCoder2 significantly outperforms strong open-source baselines and approaches the performance of proprietary models like GPT-4.1, with further gains from iterative self-debug, reaching 82.4% overall execution pass rate at the 32B scale, particularly in symbolic or compiler-dependent languages.
翻译:大型语言模型(LLMs)近期催生了能够生成、执行及修订可视化代码的编码智能体。然而,现有模型因语言覆盖有限、执行不可靠及缺乏迭代修正机制,在实际工作流中常出现失败。现有进展受限于数据集与基准的狭窄性——聚焦于单轮生成与单语言任务。为解决这些挑战,我们提出三项互补性资源以推进可视化编码智能体发展:VisCode-Multi-679K是一个大规模监督数据集,包含679K个经验证可执行的可视化样本,涵盖12种编程语言的多轮修正对话;VisPlotBench是面向系统性评估的基准,提供可执行任务、渲染输出及初始生成与多轮自我调试协议;最后,我们提出VisCoder2——基于VisCode-Multi-679K训练的多语言可视化模型家族。实验表明,VisCoder2显著超越强开源基线,性能逼近GPT-4.1等专有模型,且通过迭代自我调试带来进一步提升——在32B参数规模下整体执行通过率达82.4%,尤其体现在符号型或编译器依赖型语言中。