We address the task of evidence retrieval for long document question answering, which involves locating relevant paragraphs within a document to answer a question. We aim to assess the applicability of large language models (LLMs) in the task of zero-shot long document evidence retrieval, owing to their unprecedented performance across various NLP tasks. However, currently the LLMs can consume limited context lengths as input, thus providing document chunks as inputs might overlook the global context while missing out on capturing the inter-segment dependencies. Moreover, directly feeding the large input sets can incur significant computational costs, particularly when processing the entire document (and potentially incurring monetary expenses with enterprise APIs like OpenAI's GPT variants). To address these challenges, we propose a suite of techniques that exploit the discourse structure commonly found in documents. By utilizing this structure, we create a condensed representation of the document, enabling a more comprehensive understanding and analysis of relationships between different parts. We retain $99.6\%$ of the best zero-shot approach's performance, while processing only $26\%$ of the total tokens used by the best approach in the information seeking evidence retrieval setup. We also show how our approach can be combined with \textit{self-ask} reasoning agent to achieve best zero-shot performance in complex multi-hop question answering, just $\approx 4\%$ short of zero-shot performance using gold evidence.
翻译:我们针对长文档问答中的证据检索任务展开研究,该任务涉及在文档中定位与问题相关的段落。鉴于大语言模型在各种自然语言处理任务中展现出的卓越性能,我们旨在评估其在零样本长文档证据检索任务中的适用性。然而,当前大语言模型能够处理的输入上下文长度有限,直接输入文档分块可能会忽略全局上下文,并难以捕获跨段落的依赖关系。此外,直接输入大规模文本集将导致显著的计算成本,特别是在处理整篇文档时(使用企业级API如OpenAI的GPT变体还可能产生货币成本)。为了解决这些问题,我们提出了一套利用文档中常见话语结构的技术。通过利用这种结构,我们创建了文档的压缩表示,从而能够更全面地理解和分析不同部分之间的关系。在信息检索证据检索设置中,我们的方法仅处理最佳方法所用总标记数的26%,却保留了99.6%的最佳零样本方法性能。我们还展示了如何将我们的方法与“自问”推理代理相结合,在复杂多跳问答中实现最佳零样本性能,其表现仅比使用黄金证据的零样本性能低约4%。