Multimodal reasoning aims to enhance the capabilities of MLLMs by incorporating intermediate reasoning steps before reaching the final answer. It has evolved from text-only reasoning to the integration of visual information, enabling the thought process to be conveyed through both images and text. Despite its effectiveness, current multimodal reasoning methods depend on explicit reasoning steps that require labor-intensive vision-text annotations and inherently introduce significant inference latency. To address these issues, we introduce multimodal latent reasoning with the advantages of multimodal representation, reduced annotation, and inference efficiency. To facilitate it, we propose Interleaved Vision-Text Latent Reasoning (IVT-LR), which injects both visual and textual information in the reasoning process within the latent space. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: latent text (the hidden states from the previous step) and latent vision (a set of selected image embeddings). We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps. Experiments on M$^3$CoT and ScienceQA demonstrate that our IVT-LR method achieves an average performance increase of 5.45\% in accuracy, while simultaneously achieving a speed increase of over 5 times compared to existing approaches.
翻译:多模态推理旨在通过引入中间推理步骤来增强MLLMs的能力,最终得出答案。它已从纯文本推理发展到视觉信息的整合,使得思维过程能够通过图像和文本共同传达。尽管有效,当前的多模态推理方法依赖于显式的推理步骤,这需要大量人工标注的视觉-文本注释,并本质上引入了显著的推理延迟。为解决这些问题,我们引入了多模态潜在推理,其具备多模态表示、减少标注需求和提升推理效率的优势。为实现这一目标,我们提出了交错视觉-文本潜在推理(IVT-LR),该方法在潜在空间中将视觉和文本信息注入推理过程。具体而言,IVT-LR通过结合两个隐式部分来表示每个推理步骤:潜在文本(来自前一步的隐藏状态)和潜在视觉(一组选定的图像嵌入)。我们进一步引入了渐进式多阶段训练策略,使MLLMs能够执行上述多模态潜在推理步骤。在M$^3$CoT和ScienceQA上的实验表明,我们的IVT-LR方法在准确率上平均提升了5.45%,同时与现有方法相比,推理速度提升了超过5倍。