Multimodal reasoning aims to enhance the capabilities of MLLMs by incorporating intermediate reasoning steps before reaching the final answer. It has evolved from text-only reasoning to the integration of visual information, enabling the thought process to be conveyed through both images and text. Despite its effectiveness, current multimodal reasoning methods depend on explicit reasoning steps that require labor-intensive vision-text annotations and inherently introduce significant inference latency. To address these issues, we introduce multimodal latent reasoning with the advantages of multimodal representation, reduced annotation, and inference efficiency. To facilicate it, we propose Interleaved Vision-Text Latent Reasoning (IVT-LR), which injects both visual and textual information in the reasoning process within the latent space. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: latent text (the hidden states from the previous step) and latent vision (a set of selected image embeddings). We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps. Experiments on M3CoT and ScienceQA demonstrate that our IVT-LR method achieves an average performance increase of 5.45% in accuracy, while simultaneously achieving a speed increase of over 5 times compared to existing approaches. Code available at https://github.com/FYYDCC/IVT-LR.
翻译:多模态推理旨在通过引入中间推理步骤来增强多模态大语言模型(MLLMs)的能力,然后再得出最终答案。它已从纯文本推理发展到整合视觉信息,使得思维过程能够通过图像和文本共同传达。尽管有效,当前的多模态推理方法依赖于显式的推理步骤,这需要大量人工标注的视觉-文本数据,并且本质上会引入显著的推理延迟。为解决这些问题,我们引入了多模态潜在推理,其具备多模态表示、减少标注需求和提升推理效率的优势。为实现此目标,我们提出了交错视觉-文本潜在推理(IVT-LR),该方法在潜在空间内的推理过程中同时注入视觉和文本信息。具体而言,IVT-LR通过结合两个隐式部分来表示每个推理步骤:潜在文本(来自前一步的隐藏状态)和潜在视觉(一组经过筛选的图像嵌入)。我们进一步引入了一种渐进式多阶段训练策略,使MLLMs能够执行上述多模态潜在推理步骤。在M3CoT和ScienceQA数据集上的实验表明,我们的IVT-LR方法在准确率上平均提升了5.45%,同时与现有方法相比,推理速度提升了5倍以上。代码发布于 https://github.com/FYYDCC/IVT-LR。