Large language models (LLMs) have recently enabled coding agents capable of generating, executing, and revising visualization code. However, existing models often fail in practical workflows due to limited language coverage, unreliable execution, and lack of iterative correction mechanisms. Progress has been constrained by narrow datasets and benchmarks that emphasize single-round generation and single-language tasks. To address these challenges, we introduce three complementary resources for advancing visualization coding agents. VisCode-Multi-679K is a large-scale, supervised dataset containing 679K validated and executable visualization samples with multi-turn correction dialogues across 12 programming languages. VisPlotBench is a benchmark for systematic evaluation, featuring executable tasks, rendered outputs, and protocols for both initial generation and multi-round self-debug. Finally, we present VisCoder2, a family of multi-language visualization models trained on VisCode-Multi-679K. Experiments show that VisCoder2 significantly outperforms strong open-source baselines and approaches the performance of proprietary models like GPT-4.1, with further gains from iterative self-debug, reaching 82.4% overall execution pass rate at the 32B scale, particularly in symbolic or compiler-dependent languages.
翻译:近年来,大型语言模型(LLMs)已催生出能够生成、执行和修订可视化代码的编码智能体。然而,现有模型在实际工作流程中常因语言覆盖范围有限、执行可靠性不足以及缺乏迭代修正机制而失效。研究进展受限于狭窄的数据集和基准测试,这些资源往往侧重于单轮生成和单语言任务。为解决这些挑战,我们引入了三项互补资源以推进可视化编码智能体的发展。VisCode-Multi-679K 是一个大规模监督数据集,包含 679K 个经过验证且可执行的可视化样本,涵盖 12 种编程语言的多轮修正对话。VisPlotBench 是一个用于系统评估的基准测试,包含可执行任务、渲染输出以及针对初始生成和多轮自调试的评估协议。最后,我们提出了 VisCoder2,这是一个基于 VisCode-Multi-679K 训练的多语言可视化模型系列。实验表明,VisCoder2 显著优于强大的开源基线模型,并接近 GPT-4.1 等专有模型的性能,通过迭代自调试进一步提升了效果,在 32B 规模下整体执行通过率达到 82.4%,尤其在符号化或编译器依赖型语言中表现突出。