Programming often involves converting detailed and complex specifications into code, a process during which developers typically utilize visual aids to more effectively convey concepts. While recent developments in Large Multimodal Models have demonstrated remarkable abilities in visual reasoning and mathematical tasks, there is little work on investigating whether these models can effectively interpret visual elements for code generation. To this end, we present MMCode, the first multi-modal coding dataset for evaluating algorithmic problem-solving skills in visually rich contexts. MMCode contains 3,548 questions and 6,620 images collected from real-world programming challenges harvested from 10 code competition websites, presenting significant challenges due to the extreme demand for reasoning abilities. Our experiment results show that current state-of-the-art models struggle to solve these problems. The results highlight the lack of powerful vision-code models, and we hope MMCode can serve as an inspiration for future works in this domain. The data and code are publicly available at https://github.com/likaixin2000/MMCode.
翻译:编程通常涉及将详细且复杂的规范转换为代码,在此过程中,开发者通常借助视觉辅助工具来更有效地传达概念。尽管近期多模态大模型在视觉推理和数学任务上展现出卓越能力,但关于这些模型能否有效解析视觉元素以生成代码的研究尚少。为此,我们提出了MMCode——首个用于评估视觉丰富场景下算法问题解决能力的多模态编程数据集。MMCode包含3,548道题目和6,620张图像,收集自10个代码竞赛网站的真实编程挑战,因其对推理能力的极高要求而构成显著挑战。实验结果表明,当前最先进的模型难以解决这些问题。该结果揭示了强大视觉-代码模型的缺失,我们希望MMCode能为该领域的未来研究提供启发。数据与代码已公开于 https://github.com/likaixin2000/MMCode。