Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots

The remarkable progress of Multi-modal Large Language Models (MLLMs) has attracted significant attention due to their superior performance in visual contexts. However, their capabilities in turning visual figure to executable code, have not been evaluated thoroughly. To address this, we introduce Plot2Code, a comprehensive visual coding benchmark designed for a fair and in-depth assessment of MLLMs. We carefully collect 132 manually selected high-quality matplotlib plots across six plot types from publicly available matplotlib galleries. For each plot, we carefully offer its source code, and an descriptive instruction summarized by GPT-4. This approach enables Plot2Code to extensively evaluate MLLMs' code capabilities across various input modalities. Furthermore, we propose three automatic evaluation metrics, including code pass rate, text-match ratio, and GPT-4V overall rating, for a fine-grained assessment of the output code and rendered images. Instead of simply judging pass or fail, we employ GPT-4V to make an overall judgement between the generated and reference images, which has been shown to be consistent with human evaluation. The evaluation results, which include analyses of 14 MLLMs such as the proprietary GPT-4V, Gemini-Pro, and the open-sourced Mini-Gemini, highlight the substantial challenges presented by Plot2Code. With Plot2Code, we reveal that most existing MLLMs struggle with visual coding for text-dense plots, heavily relying on textual instruction. We hope that the evaluation results from Plot2Code on visual coding will guide the future development of MLLMs. All data involved with Plot2Code are available at https://huggingface.co/datasets/TencentARC/Plot2Code.

翻译：多模态大语言模型（MLLMs）在视觉任务中展现出卓越性能，但其将可视化图表转化为可执行代码的能力尚未得到全面评估。针对这一不足，我们提出Plot2Code——一个专为公平深入评估MLLMs而设计的综合性视觉编程基准测试。我们从公开matplotlib图库中精心收集了132个高质量matplotlib图表，涵盖六种图表类型，并为每个图表提供其源码及由GPT-4生成的描述性指令。该设计使Plot2Code能广泛评估MLLMs在不同输入模态下的代码生成能力。此外，我们提出三项自动化评估指标：代码通过率、文本匹配率和GPT-4V综合评分，以实现对输出代码与渲染图像的细粒度评估。不同于简单的通过/失败判定，我们采用GPT-4V对生成图像与参考图像进行综合评判，该方法已被证明与人类评估高度一致。针对14个MLLMs（包括商业闭源的GPT-4V、Gemini-Pro与开源的Mini-Gemini）的评估结果揭示了Plot2Code带来的显著挑战。通过Plot2Code，我们发现现有MLLMs在文本密集图表的视觉编码方面存在困难，且高度依赖文本指令。期望Plot2Code在视觉编码领域的评估结果能为MLLMs的未来发展方向提供指引。Plot2Code相关数据均可通过https://huggingface.co/datasets/TencentARC/Plot2Code获取。