Large Language Models (LLMs), acting as a powerful reasoner and generator, exhibit extraordinary performance across various natural language tasks, such as question answering (QA). Among these tasks, Multi-Hop Question Answering (MHQA) stands as a widely discussed category, necessitating seamless integration between LLMs and the retrieval of external knowledge. Existing methods employ LLM to generate reasoning paths and plans, and utilize IR to iteratively retrieve related knowledge, but these approaches have inherent flaws. On one hand, Information Retriever (IR) is hindered by the low quality of generated queries by LLM. On the other hand, LLM is easily misguided by the irrelevant knowledge by IR. These inaccuracies, accumulated by the iterative interaction between IR and LLM, lead to a disaster in effectiveness at the end. To overcome above barriers, in this paper, we propose a novel pipeline for MHQA called Furthest-Reasoning-with-Plan-Assessment (FuRePA), including an improved framework (Furthest Reasoning) and an attached module (Plan Assessor). 1) Furthest reasoning operates by masking previous reasoning path and generated queries for LLM, encouraging LLM generating chain of thought from scratch in each iteration. This approach enables LLM to break the shackle built by previous misleading thoughts and queries (if any). 2) The Plan Assessor is a trained evaluator that selects an appropriate plan from a group of candidate plans proposed by LLM. Our methods are evaluated on three highly recognized public multi-hop question answering datasets and outperform state-of-the-art on most metrics (achieving a 10%-12% in answer accuracy).
翻译:大语言模型(LLMs)作为强大的推理器和生成器,在各类自然语言任务(如问答QA)中展现出卓越性能。其中,多跳问答(MHQA)作为广泛讨论的类别,需要在大语言模型与外部知识检索之间实现无缝集成。现有方法采用大语言模型生成推理路径与计划,并利用信息检索(IR)迭代检索相关知识,但这些方法存在固有缺陷:一方面,信息检索器(IR)受限于大语言模型所生成查询的低质量;另一方面,大语言模型易被信息检索器返回的不相关知识误导。这种由信息检索器与大语言模型迭代交互积累的不准确性,最终导致效果灾难。为克服上述障碍,本文提出一种名为"基于计划评估的最远推理"(FuRePA)的新型多跳问答流水线,包含改进框架(最远推理)与附加模块(计划评估器)。1)最远推理通过屏蔽大语言模型先前的推理路径与生成查询,促使大语言模型在每次迭代中从头生成思维链。该方法使大语言模型能够突破先前误导性思维与查询(若存在)构建的束缚。2)计划评估器是一个经过训练的评估器,用于从大语言模型提出的候选计划组中选择合适计划。我们在三个广泛认可的多跳问答公开数据集上评估方法,在多数指标上超越现有最优水平(答案准确率提升10%-12%)。