Omni-I2C: A Holistic Benchmark for High-Fidelity Image-to-Code Generation

We present Omni-I2C, a comprehensive benchmark designed to evaluate the capability of Large Multimodal Models (LMMs) in converting complex, structured digital graphics into executable code. We argue that this task represents a non-trivial challenge for the current generation of LMMs: it demands an unprecedented synergy between high-fidelity visual perception -- to parse intricate spatial hierarchies and symbolic details -- and precise generative expression -- to synthesize syntactically sound and logically consistent code. Unlike traditional descriptive tasks, Omni-I2C requires a holistic understanding where any minor perceptual hallucination or coding error leads to a complete failure in visual reconstruction. Omni-I2C features 1080 meticulously curated samples, defined by its breadth across subjects, image modalities, and programming languages. By incorporating authentic user-sourced cases, the benchmark spans a vast spectrum of digital content -- from scientific visualizations to complex symbolic notations -- each paired with executable reference code. To complement this diversity, our evaluation framework provides necessary depth; by decoupling performance into perceptual fidelity and symbolic precision, it transcends surface-level accuracy to expose the granular structural failures and reasoning bottlenecks of current LMMs. Our evaluation reveals a substantial performance gap among leading LMMs; even state-of-the-art models struggle to preserve structural integrity in complex scenarios, underscoring that multimodal code generation remains a formidable challenge. Data and code are available at https://github.com/MiliLab/Omni-I2C.

翻译：我们提出了Omni-I2C，这是一个旨在评估大型多模态模型将复杂、结构化的数字图形转换为可执行代码能力的综合性基准。我们认为，这项任务对当前一代的LMMs构成了一个非平凡的挑战：它要求在高保真视觉感知——用于解析复杂的空间层次和符号细节——与精确的生成表达——用于合成语法正确且逻辑一致的代码——之间实现前所未有的协同。与传统描述性任务不同，Omni-I2C需要一种整体性的理解，其中任何微小的感知幻觉或编码错误都会导致视觉重建的完全失败。Omni-I2C包含1080个精心策划的样本，其特点在于其跨越学科、图像模态和编程语言的广度。通过纳入真实的用户来源案例，该基准涵盖了从科学可视化到复杂符号表示等广泛的数字内容，每个案例都配有可执行的参考代码。为了补充这种多样性，我们的评估框架提供了必要的深度；通过将性能解耦为感知保真度和符号精度，它超越了表面级的准确性，揭示了当前LMMs的细粒度结构缺陷和推理瓶颈。我们的评估揭示了领先LMMs之间存在显著的性能差距；即使是最先进的模型也难以在复杂场景中保持结构完整性，这突显了多模态代码生成仍然是一个艰巨的挑战。数据和代码可在 https://github.com/MiliLab/Omni-I2C 获取。