We introduce a new benchmark, ChartMimic, aimed at assessing the visually-grounded code generation capabilities of large multimodal models (LMMs). ChartMimic utilizes information-intensive visual charts and textual instructions as inputs, requiring LMMs to generate the corresponding code for chart rendering. ChartMimic includes 4,800 human-curated (figure, instruction, code) triplets, which represent the authentic chart use cases found in scientific papers across various domains (e.g., Physics, Computer Science, Economics, etc). These charts span 18 regular types and 4 advanced types, diversifying into 201 subcategories. Furthermore, we propose multi-level evaluation metrics to provide an automatic and thorough assessment of the output code and the rendered charts. Unlike existing code generation benchmarks, ChartMimic places emphasis on evaluating LMMs' capacity to harmonize a blend of cognitive capabilities, encompassing visual understanding, code generation, and cross-modal reasoning. The evaluation of $3$ proprietary models and 14 open-weight models highlights the substantial challenges posed by ChartMimic. Even the advanced GPT-4o, InternVL2-Llama3-76B only achieved an average score across Direct Mimic and Customized Mimic tasks of 82.2 and 61.6, respectively, indicating significant room for improvement. We anticipate that ChartMimic will inspire the development of LMMs, advancing the pursuit of artificial general intelligence.
翻译:我们提出了一个新的基准测试ChartMimic,旨在评估大语言模型(LMMs)基于视觉的代码生成能力。ChartMimic以信息密集的视觉图表和文本指令作为输入,要求LMMs生成相应的图表渲染代码。该基准包含4,800个人工整理的(图表、指令、代码)三元组,代表了来自多个领域(如物理学、计算机科学、经济学等)科学论文中的真实图表用例。这些图表涵盖18种常规类型和4种高级类型,并进一步细分为201个子类别。此外,我们提出了多层次的评估指标,以对生成的代码和渲染出的图表进行自动且全面的评估。与现有的代码生成基准不同,ChartMimic着重评估LMMs协调多种认知能力的能力,包括视觉理解、代码生成和跨模态推理。对3个专有模型和14个开源权重的模型的评估凸显了ChartMimic带来的巨大挑战。即使是先进的GPT-4o和InternVL2-Llama3-76B模型,在直接模仿和定制模仿任务上的平均得分也分别仅为82.2和61.6,表明仍有巨大的改进空间。我们期望ChartMimic能够启发LMMs的发展,推动对通用人工智能的追求。