While recent large language models (LLMs) improve on various question answering (QA) datasets, it remains difficult for a single model to generalize across question types that require distinct reasoning abilities. We provide empirical evidence that state-of-the-art LLMs suffer from poor generalizability on reasoning types beyond those seen in the prompt. To remedy this, we propose a Mixture-of-Reasoning-Experts (MoRE) framework that ensembles diverse specialized language models. We specialize the backbone language model with prompts optimized for different reasoning categories, including factual, multihop, mathematical, and commonsense reasoning. Our key insight is to leverage agreement among the specialized experts to select the best answer for each question, or to abstain from answering. This gives MoRE higher accuracy than any single specialized model on a collection of 12 QA datasets from four reasoning types. Beyond generalizability, the interpretable design of MoRE improves selective question answering results compared to baselines without incorporating inter-expert agreement. This framework is also more interpretable and useful to human consumers of QA outputs. Our human study confirms that presenting expert predictions and the answer selection process helps annotators more accurately calibrate when to trust the system's output. We release all code and data to facilitate future work.
翻译:尽管近期的大型语言模型(LLM)在各类问答(QA)数据集上表现有所提升,但单一模型仍难以泛化至需要不同推理能力的问题类型。我们通过实验证明,当前最先进的LLM在应对提示中未出现的推理类型时泛化能力不足。为解决此问题,我们提出"推理专家混合"(MoRE)框架,该框架集成了多样化的专业语言模型。我们通过针对不同推理类别(包括事实型、多跳型、数学型与常识型推理)优化的提示来专业化骨干语言模型。关键洞察在于利用专业专家间的共识来为每个问题选择最佳答案,或选择弃权。这使得MoRE在涵盖四种推理类型的12个QA数据集集合上,准确率高于任何单一专业模型。除泛化能力外,MoRE的可解释性设计相比未引入专家间共识的基线方法,显著提升了选择性问答结果。该框架还为QA输出的使用者提供了更高的可解释性与实用性。我们开展的人机实验证实,展示专家预测结果及答案选择过程,能帮助标注者更准确地校准何时信任系统输出。我们已公开全部代码与数据以促进后续研究。