Chain-of-thought (CoT) reasoning has exhibited impressive performance in language models for solving complex tasks and answering questions. However, many real-world questions require multi-modal information, such as text and images. Previous research on multi-modal CoT has primarily focused on extracting fixed image features from off-the-shelf vision models and then fusing them with text using attention mechanisms. This approach has limitations because these vision models were not designed for complex reasoning tasks and do not align well with language thoughts. To overcome this limitation, we introduce a novel approach for multi-modal CoT reasoning that utilizes latent space learning via diffusion processes to generate effective image features that align with language thoughts. Our method fuses image features and text representations at a deep level and improves the complex reasoning ability of multi-modal CoT. We demonstrate the efficacy of our proposed method on multi-modal ScienceQA and machine translation benchmarks, achieving state-of-the-art performance on ScienceQA. Overall, our approach offers a more robust and effective solution for multi-modal reasoning in language models, enhancing their ability to tackle complex real-world problems.
翻译:思维链推理已在语言模型中展现出在解决复杂任务和回答问题方面的卓越性能。然而,许多实际问题需要多模态信息,例如文本和图像。以往关于多模态思维链的研究主要集中于从现成的视觉模型中提取固定图像特征,然后通过注意力机制将其与文本进行融合。这种方法存在局限性,因为这些视觉模型并非为复杂推理任务而设计,且未能与语言思维良好对齐。为突破这一局限,我们提出一种新颖的多模态思维链推理方法,该方法通过扩散过程利用潜在空间学习来生成与语言思维对齐的有效图像特征。我们的方法在深层融合图像特征与文本表示,并提升了多模态思维链的复杂推理能力。我们在多模态ScienceQA和机器翻译基准上验证了所提方法的有效性,在ScienceQA上达到了最先进的性能。总体而言,我们的方法为语言模型中的多模态推理提供了更鲁棒且高效的解决方案,增强了其解决复杂现实问题的能力。