Programming often involves converting detailed and complex specifications into code, a process during which developers typically utilize visual aids to more effectively convey concepts. While recent developments in Large Multimodal Models have demonstrated remarkable abilities in visual reasoning and mathematical tasks, there is little work on investigating whether these models can effectively interpret visual elements for code generation. To this end, we present MMCode, the first multi-modal coding dataset for evaluating algorithmic problem-solving skills in visually rich contexts. MMCode contains 3,548 questions and 6,620 images collected from real-world programming challenges harvested from 10 code competition websites, presenting significant challenges due to the extreme demand for reasoning abilities. Our experiment results show that current state-of-the-art models struggle to solve these problems. The results highlight the lack of powerful vision-code models, and we hope MMCode can serve as an inspiration for future works in this domain. The data and code are publicly available at https://github.com/happylkx/MMCode.
翻译:编程通常涉及将详细而复杂的规范转化为代码,在此过程中开发者常借助视觉辅助更有效地传达概念。尽管近年来大型多模态模型在视觉推理与数学任务中展现出卓越能力,但关于这些模型能否有效解读视觉元素以生成代码的研究仍十分有限。为此,我们提出MMCode——首个评估算法问题解决能力的多模态代码数据集,其问题场景具有丰富的视觉上下文。MMCode包含从10个编程竞赛网站的真实编程挑战中收集的3,548道题目与6,620张图像,因对推理能力要求极高而构成显著挑战。实验结果表明,当前最先进的模型难以解决这些问题。这些发现凸显了强视觉能力代码模型的匮乏,我们期望MMCode能激发该领域未来的研究。数据集与代码已公开于https://github.com/happylkx/MMCode。