The advancement of Large Language Models(LLMs) has brought substantial attention to the Chain of Thought(CoT) approach, primarily due to its ability to enhance the capability of LLMs on tasks requiring complex reasoning. Moreover, the significance of CoT approaches extends to the application of LLMs for multi-modal tasks, such as multi-modal question answering. However, the selection of optimal CoT demonstration examples in multi-modal reasoning for LLMs remains less explored for LLMs due to the inherent complexity of multi-modal examples. In this paper, we introduce a novel approach that addresses this challenge by using retrieval mechanisms to dynamically and automatically select demonstration examples based on cross-modal similarities. This method aims to refine the CoT reasoning process in multi-modal scenarios via informing LLMs with more relevant and informative examples. Furthermore, we employ a stratified sampling method categorising demonstration examples into groups based on their types and retrieving examples from different groups respectively to promote the diversity of demonstration examples. Through a series of experiments, we demonstrate that our approach significantly improves the performance of LLMs, achieving state-of-the-art results in multi-modal reasoning tasks. Specifically, our methods demonstrate significant advancements on the ScienceQA dataset. While our method based on ChatGPT outperforms the Chameleon(ChatGPT) by 2.74% with an accuracy of 82.67%, the GPT4-based approach surpasses the Chameleon(GPT-4) by 0.89%, achieving 87.43% on accuracy under the same setting. Moreover, our best performing show a 6.05% increase over Chameleon for ChatGPT-based models and a 4.57% increase for GPT-4-based models.
翻译:大语言模型(LLMs)的进步使思维链(Chain of Thought, CoT)方法备受关注,这主要源于其增强LLMs处理复杂推理任务的能力。此外,CoT方法对LLMs在多模态任务(如多模态问答)中的应用也具有重要价值。然而,由于多模态示例固有的复杂性,如何为LLMs的多模态推理选择最优CoT演示示例仍是一个有待深入探索的问题。本文提出了一种创新方法,通过检索机制根据跨模态相似性动态自动选取演示示例来应对这一挑战。该方法通过向LLMs提供更具相关性和信息量的示例,旨在优化多模态场景下的CoT推理过程。同时,我们采用分层采样方法,根据类型对演示示例进行分组,并从不同分组中分别检索示例,以增强演示示例的多样性。通过系列实验,我们证明该方法显著提升了LLMs的性能,在多模态推理任务中取得了最优结果。具体而言,我们的方法在ScienceQA数据集上展现了显著进步:基于ChatGPT的方法在准确率达82.67%的情况下,比Chameleon(ChatGPT)提升2.74%;而基于GPT-4的方法在相同设定下准确率达87.43%,比Chameleon(GPT-4)提升0.89%。此外,我们的最优模型在ChatGPT和GPT-4基模型上分别比Chameleon提升了6.05%和4.57%。