With the rise of large-scale language models (LLMs), it is currently popular and effective to convert multimodal information into text descriptions for multimodal multi-hop question answering. However, we argue that the current methods of multi-modal multi-hop question answering still mainly face two challenges: 1) The retrieved evidence containing a large amount of redundant information, inevitably leads to a significant drop in performance due to irrelevant information misleading the prediction. 2) The reasoning process without interpretable reasoning steps makes the model difficult to discover the logical errors for handling complex questions. To solve these problems, we propose a unified LLMs-based approach but without heavily relying on them due to the LLM's potential errors, and innovatively treat multimodal multi-hop question answering as a joint entailment tree generation and question answering problem. Specifically, we design a multi-task learning framework with a focus on facilitating common knowledge sharing across interpretability and prediction tasks while preventing task-specific errors from interfering with each other via mixture of experts. Afterward, we design an iterative feedback mechanism to further enhance both tasks by feeding back the results of the joint training to the LLM for regenerating entailment trees, aiming to iteratively refine the potential answer. Notably, our method has won the first place in the official leaderboard of WebQA (since April 10, 2024), and achieves competitive results on MultimodalQA.
翻译:随着大规模语言模型(LLM)的兴起,将多模态信息转换为文本描述以进行多模态多跳问答已成为当前流行且有效的方法。然而,我们认为当前的多模态多跳问答方法仍主要面临两大挑战:1)检索到的证据包含大量冗余信息,不可避免地因无关信息误导预测而导致性能显著下降;2)缺乏可解释推理步骤的推理过程使模型难以发现逻辑错误以处理复杂问题。为解决这些问题,我们提出了一种基于LLM的统一方法,但不过度依赖LLM以避免其潜在错误,并创新性地将多模态多跳问答视为联合蕴含树生成与问答问题。具体而言,我们设计了一个多任务学习框架,重点促进可解释性任务与预测任务间的常识共享,同时通过专家混合机制防止任务特定错误相互干扰。随后,我们设计了迭代反馈机制,通过将联合训练结果反馈给LLM以重新生成蕴含树,旨在迭代优化潜在答案,从而进一步增强两项任务。值得注意的是,我们的方法已在WebQA官方排行榜(自2024年4月10日起)获得第一名,并在MultimodalQA上取得了具有竞争力的结果。