As a multimodal extension of Chain-of-Thought (CoT), Thinking with Images (TWI) has recently emerged as a promising avenue to enhance the reasoning capability of Multi-modal Large Language Models (MLLMs), which generates interleaved CoT by incorporating visual cues into the textual reasoning process. However, the success of existing TWI methods heavily relies on the assumption that interleaved image-text CoTs are faultless, which is easily violated in real-world scenarios due to the complexity of multimodal understanding. In this paper, we reveal and study a highly-practical yet under-explored problem in TWI, termed Noisy Thinking (NT). Specifically, NT refers to the imperfect visual cues mining and answer reasoning process. As the saying goes, ``One mistake leads to another'', erroneous interleaved CoT would cause error accumulation, thus significantly degrading the performance of MLLMs. To solve the NT problem, we propose a novel method dubbed Reliable Thinking with Images (RTWI). In brief, RTWI estimates the reliability of visual cues and textual CoT in a unified text-centric manner and accordingly employs robust filtering and voting modules to prevent NT from contaminating the final answer. Extensive experiments on seven benchmarks verify the effectiveness of RTWI against NT.
翻译:作为思维链(CoT)的多模态扩展,基于图像的推理(TWI)近年来已成为增强多模态大语言模型(MLLMs)推理能力的一条前景广阔的途径,其通过将视觉线索融入文本推理过程来生成交错的CoT。然而,现有TWI方法的成功严重依赖于一个假设,即交错的图像-文本CoT是无误的。由于多模态理解的复杂性,这一假设在现实场景中极易被违背。在本文中,我们揭示并研究了TWI中一个高度实用但未被充分探索的问题,称为噪声推理(NT)。具体而言,NT指的是不完美的视觉线索挖掘与答案推理过程。正如俗语所言“一步错,步步错”,错误的交错CoT会导致误差累积,从而显著降低MLLMs的性能。为解决NT问题,我们提出了一种名为基于图像的可靠推理(RTWI)的新方法。简而言之,RTWI以统一的文本中心方式评估视觉线索与文本CoT的可靠性,并相应地采用鲁棒的过滤与投票模块,以防止NT污染最终答案。在七个基准测试上的大量实验验证了RTWI针对NT问题的有效性。