As a multimodal extension of Chain-of-Thought (CoT), Thinking with Images (TWI) has recently emerged as a promising avenue to enhance the reasoning capability of Multi-modal Large Language Models (MLLMs), which generates interleaved CoT by incorporating visual cues into the textual reasoning process. However, the success of existing TWI methods heavily relies on the assumption that interleaved image-text CoTs are faultless, which is easily violated in real-world scenarios due to the complexity of multimodal understanding. In this paper, we reveal and study a highly-practical yet under-explored problem in TWI, termed Noisy Thinking (NT). Specifically, NT refers to the imperfect visual cues mining and answer reasoning process. As the saying goes, ``One mistake leads to another'', erroneous interleaved CoT would cause error accumulation, thus significantly degrading the performance of MLLMs. To solve the NT problem, we propose a novel method dubbed Reliable Thinking with Images (RTWI). In brief, RTWI estimates the reliability of visual cues and textual CoT in a unified text-centric manner and accordingly employs robust filtering and voting modules to prevent NT from contaminating the final answer. Extensive experiments on seven benchmarks verify the effectiveness of RTWI against NT.
翻译:作为思维链(CoT)的多模态扩展,基于图像的思维推理(TWI)近年来已成为增强多模态大语言模型(MLLMs)推理能力的重要途径,其通过将视觉线索融入文本推理过程来生成交错式的图像-文本思维链。然而,现有TWI方法的成功严重依赖于“交错式图像-文本思维链完全正确”这一假设,该假设在多模态理解的复杂性影响下极易在实际场景中被违背。本文揭示并研究了TWI中一个极具实际意义却尚未被充分探索的问题——噪声思维(NT)。具体而言,NT指不完善的视觉线索挖掘与答案推理过程。正如俗语所言“一步错,步步错”,错误的交错式思维链会导致误差累积,从而显著降低MLLMs的性能。为解决NT问题,我们提出了一种名为“基于图像的可靠思维推理”(RTWI)的新方法。简言之,RTWI以统一的文本中心化方式评估视觉线索与文本思维链的可靠性,并相应采用鲁棒的过滤与投票模块来防止NT污染最终答案。在七个基准测试上的大量实验验证了RTWI针对NT问题的有效性。