Structured image understanding, such as interpreting tables and charts, requires strategically refocusing across various structures and texts within an image, forming a reasoning sequence to arrive at the final answer. However, current multimodal large language models (LLMs) lack this multihop selective attention capability. In this work, we introduce ReFocus, a simple yet effective framework that equips multimodal LLMs with the ability to generate "visual thoughts" by performing visual editing on the input image through code, shifting and refining their visual focuses. Specifically, ReFocus enables multimodal LLMs to generate Python codes to call tools and modify the input image, sequentially drawing boxes, highlighting sections, and masking out areas, thereby enhancing the visual reasoning process. We experiment upon a wide range of structured image understanding tasks involving tables and charts. ReFocus largely improves performance on all tasks over GPT-4o without visual editing, yielding an average gain of 11.0% on table tasks and 6.8% on chart tasks. We present an in-depth analysis of the effects of different visual edits, and reasons why ReFocus can improve the performance without introducing additional information. Further, we collect a 14k training set using ReFocus, and prove that such visual chain-of-thought with intermediate information offers a better supervision than standard VQA data, reaching a 8.0% average gain over the same model trained with QA pairs and 2.6% over CoT.
翻译:结构化图像理解,例如解读表格和图表,需要策略性地在图像内的各种结构和文本之间重新聚焦,形成一个推理序列以得出最终答案。然而,当前的多模态大语言模型缺乏这种多跳选择性注意能力。在本工作中,我们引入了ReFocus,一个简单而有效的框架,它通过代码对输入图像执行视觉编辑,转移并细化其视觉焦点,从而赋予多模态大语言模型生成"视觉思维"的能力。具体而言,ReFocus使多模态大语言模型能够生成Python代码来调用工具并修改输入图像,依次绘制方框、高亮区域和遮盖部分,从而增强视觉推理过程。我们在涉及表格和图表的广泛结构化图像理解任务上进行了实验。与不具备视觉编辑功能的GPT-4o相比,ReFocus在所有任务上均大幅提升了性能,在表格任务上平均增益达11.0%,在图表任务上平均增益达6.8%。我们深入分析了不同视觉编辑的效果,并探讨了ReFocus为何能在不引入额外信息的情况下提升性能。此外,我们利用ReFocus收集了一个包含14k样本的训练集,并证明这种带有中间信息的视觉思维链比标准的视觉问答数据提供了更好的监督,相比使用问答对训练的同一模型平均增益达8.0%,相比思维链训练增益达2.6%。