State-of-the-art Large Language Models (LLMs) are accredited with an increasing number of different capabilities, ranging from reading comprehension, over advanced mathematical and reasoning skills to possessing scientific knowledge. In this paper we focus on their multi-hop reasoning capability: the ability to identify and integrate information from multiple textual sources. Given the concerns with the presence of simplifying cues in existing multi-hop reasoning benchmarks, which allow models to circumvent the reasoning requirement, we set out to investigate, whether LLMs are prone to exploiting such simplifying cues. We find evidence that they indeed circumvent the requirement to perform multi-hop reasoning, but they do so in more subtle ways than what was reported about their fine-tuned pre-trained language model (PLM) predecessors. Motivated by this finding, we propose a challenging multi-hop reasoning benchmark, by generating seemingly plausible multi-hop reasoning chains, which ultimately lead to incorrect answers. We evaluate multiple open and proprietary state-of-the-art LLMs, and find that their performance to perform multi-hop reasoning is affected, as indicated by up to 45% relative decrease in F1 score when presented with such seemingly plausible alternatives. We conduct a deeper analysis and find evidence that while LLMs tend to ignore misleading lexical cues, misleading reasoning paths indeed present a significant challenge.
翻译:最先进的大型语言模型(LLMs)被赋予越来越多的不同能力,涵盖从阅读理解、高级数学与推理技能到掌握科学知识等多个领域。本文聚焦于其多跳推理能力:即从多个文本源中识别并整合信息的能力。鉴于现有多跳推理基准中存在简化线索的担忧——这些线索使模型能够规避推理要求——我们着手研究LLMs是否倾向于利用此类简化线索。我们发现证据表明它们确实会规避执行多跳推理的要求,但其方式比先前报道的经过微调的预训练语言模型(PLMs)更为微妙。受此发现启发,我们通过生成看似合理但最终导向错误答案的多跳推理链,提出了一个具有挑战性的多跳推理基准。我们评估了多个开源和专有的最先进LLMs,发现当呈现此类看似合理的替代选项时,其执行多跳推理的性能受到影响,表现为F1分数相对下降高达45%。通过深入分析,我们发现证据表明:虽然LLMs倾向于忽略误导性词汇线索,但误导性推理路径确实构成了重大挑战。