Multimodal reasoning is a challenging task that requires models to reason across multiple modalities to answer questions. Existing approaches have made progress by incorporating language and visual modalities into a two-stage reasoning framework, separating rationale generation from answer inference. However, these approaches often fall short due to the inadequate quality of the generated rationales. In this work, we delve into the importance of rationales in model reasoning. We observe that when rationales are completely accurate, the model's accuracy significantly improves, highlighting the need for high-quality rationale generation. Motivated by this, we propose MC-CoT, a self-consistency training strategy that generates multiple rationales and answers, subsequently selecting the most accurate through a voting process. This approach not only enhances the quality of generated rationales but also leads to more accurate and robust answers. Through extensive experiments, we demonstrate that our approach significantly improves model performance across various benchmarks. Remarkably, we show that even smaller base models, when equipped with our proposed approach, can achieve results comparable to those of larger models, illustrating the potential of our approach in harnessing the power of rationales for improved multimodal reasoning. The code is available at https://github.com/chengtan9907/mc-cot.
翻译:多模态推理是一项具有挑战性的任务,要求模型跨多种模态进行推理以回答问题。现有方法通过将语言和视觉模态整合到两阶段推理框架中,将理由生成与答案推理分离,取得了一定进展。然而,这些方法常因生成的理由质量不足而效果有限。在本工作中,我们深入探究了理由在模型推理中的重要性。我们观察到,当理由完全准确时,模型的准确率显著提升,这凸显了高质量理由生成的需求。受此启发,我们提出MC-CoT,一种自一致性训练策略,该策略生成多个理由和答案,随后通过投票过程选择最准确的组合。此方法不仅提升了生成理由的质量,还带来了更准确且鲁棒的答案。通过大量实验,我们证明该方法在各种基准测试中显著提升了模型性能。值得注意的是,即使较小的基础模型采用我们提出的方法,也能达到与大型模型相媲美的结果,这展示了我们方法在利用理由提升多模态推理能力方面的潜力。代码已开源:https://github.com/chengtan9907/mc-cot。