Large Language Models (LLMs) have demonstrated impressive performance in natural language processing tasks by leveraging chain of thought (CoT) that enables step-by-step thinking. Extending LLMs with multimodal capabilities is the recent interest, but incurs computational cost and requires substantial hardware resources. To address these challenges, we propose KAM-CoT a framework that integrates CoT reasoning, Knowledge Graphs (KGs), and multiple modalities for a comprehensive understanding of multimodal tasks. KAM-CoT adopts a two-stage training process with KG grounding to generate effective rationales and answers. By incorporating external knowledge from KGs during reasoning, the model gains a deeper contextual understanding reducing hallucinations and enhancing the quality of answers. This knowledge-augmented CoT reasoning empowers the model to handle questions requiring external context, providing more informed answers. Experimental findings show KAM-CoT outperforms the state-of-the-art methods. On the ScienceQA dataset, we achieve an average accuracy of 93.87%, surpassing GPT-3.5 (75.17%) by 18% and GPT-4 (83.99%) by 10%. Remarkably, KAM-CoT achieves these results with only 280M trainable parameters at a time, demonstrating its cost-efficiency and effectiveness.
翻译:大语言模型(LLMs)通过利用思维链(CoT)实现逐步推理,在自然语言处理任务中展现出卓越性能。近期研究致力于增强LLMs的多模态能力,但此类扩展需要消耗大量计算资源与硬件设施。为应对这些挑战,我们提出KAM-CoT框架——该框架融合CoT推理、知识图谱(KGs)与多模态技术,以实现对多模态任务的全面理解。KAM-CoT采用基于KG锚定的两阶段训练流程,生成高效推理路径与答案。通过在推理过程中融入KG外部知识,模型获得更深入的上下文理解能力,有效减少幻觉现象并提升答案质量。这种知识增强型CoT推理使模型能够处理需要外部语境支撑的问题,给出更可靠的答案。实验结果表明,KAM-CoT性能优于现有最优方法。在ScienceQA数据集上,该方法取得93.87%的平均准确率,分别超越GPT-3.5(75.17%)18个百分点及GPT-4(83.99%)10个百分点。值得注意的是,KAM-CoT仅需2.8亿可训练参数即可达到此性能,充分验证其成本效益与有效性。