Beyond NL2Code: A Structured Survey of Multimodal Code Intelligence

Xuanle Zhao,Qiushi Sun,Jingyu Xiao,Xuexin Liu,Haoyue Yang,Qiaosheng Chen,Xianzhen Luo,Jing Huang,Yufeng Zhong,Lei Chen,Shuai Fu,Zhenlin Wei,Jinhe Bi,Lei Jiang,Haibo Qiu,Siqi Yang,Peng Shi,Jian Hu,Zhixiong Zeng

from arxiv, Work completed in January 2026. Updating now

While LLMs have substantially advanced text-to-code synthesis, many real programming tasks specify intent through visual artifacts such as screenshots, charts, documents, vector drawings, videos, and interactive states. These tasks require models to connect visual perception to executable programs, because correctness depends not only on syntax but also on layout, geometry, data semantics, editability, interaction behavior, and domain-specific constraints that apply after execution. This survey examines Multimodal Code Intelligence, covering systems that generate, edit, refine, execute, or reason with code under visually grounded inputs and outputs. We first formulate the field by the role that code plays in each task, distinguishing code as a rendered artifact, an editable symbolic structure, a scientific representation, an intermediate reasoning trace, or an executable policy or tool interface. We then organize benchmarks and methods into four domains: Graphical User Interface, Scientific Visualization, Structured Graphics, and Frontier Tasks and Frameworks. This taxonomy connects mature artifact-generation problems to emerging agentic and unified settings and allows us to compare how different tasks treat evidence of correctness. Looking ahead, we argue that future research may benefit from four verification-centered directions. Multi-signal validation can combine complementary evidence of correctness, multi-state verification can test behavior across execution trajectories, cross-task transfer testing can probe reusable visual-code skills, and verifiable agent traces can reveal whether agent actions are grounded in visual evidence. Together, these directions may move multimodal code generation from single-output imitation toward evidence-grounded executable systems.

翻译：尽管大语言模型显著推进了文本到代码的合成，但许多实际编程任务通过可视化工件（如截图、图表、文档、矢量图、视频和交互状态）来指定意图。这些任务要求模型将视觉感知与可执行程序相连接，因为正确性不仅取决于语法，还取决于布局、几何结构、数据语义、可编辑性、交互行为以及执行后应用的领域特定约束。本综述考察了多模态代码智能，涵盖在视觉引导的输入和输出下生成、编辑、优化、执行或推理代码的系统。我们首先根据代码在每个任务中的角色对该领域进行公式化，区分代码作为渲染工件、可编辑符号结构、科学表示、中间推理轨迹或可执行策略/工具接口。然后，我们将基准和方法组织为四个领域：图形用户界面、科学可视化、结构化图形以及前沿任务与框架。这种分类法将成熟的工件生成问题与新兴的智能体和统一设置联系起来，并使我们能够比较不同任务如何处理正确性证据。展望未来，我们认为未来研究可能受益于四个以验证为中心的方向。多信号验证可以结合互补的正确性证据，多状态验证可以测试跨执行轨迹的行为，跨任务迁移测试可以探测可复用的视觉-代码技能，而可验证的智能体轨迹可以揭示智能体行为是否基于视觉证据。总之，这些方向可能推动多模态代码生成从单输出模仿走向基于证据的可执行系统。