Understanding temporal relations and answering time-sensitive questions is crucial yet a challenging task for question-answering systems powered by large language models (LLMs). Existing approaches either update the parametric knowledge of LLMs with new facts, which is resource-intensive and often impractical, or integrate LLMs with external knowledge retrieval (i.e., retrieval-augmented generation). However, off-the-shelf retrievers often struggle to identify relevant documents that require intensive temporal reasoning. To systematically study time-sensitive question answering, we introduce the TempRAGEval benchmark, which repurposes existing datasets by incorporating temporal perturbations and gold evidence labels. As anticipated, all existing retrieval methods struggle with these temporal reasoning-intensive questions. We further propose Modular Retrieval (MRAG), a trainless framework that includes three modules: (1) Question Processing that decomposes question into a main content and a temporal constraint; (2) Retrieval and Summarization that retrieves evidence and uses LLMs to summarize according to the main content; (3) Semantic-Temporal Hybrid Ranking that scores each evidence summarization based on both semantic and temporal relevance. On TempRAGEval, MRAG significantly outperforms baseline retrievers in retrieval performance, leading to further improvements in final answer accuracy.
翻译:理解时间关系并回答时间敏感问题,对于由大语言模型(LLM)驱动的问答系统而言至关重要,同时也是一项极具挑战性的任务。现有方法要么通过新事实更新LLM的参数化知识(这种方式资源密集且通常不切实际),要么将LLM与外部知识检索(即检索增强生成)相结合。然而,现成的检索器往往难以识别需要进行深度时间推理的相关文档。为了系统性地研究时间敏感问答,我们引入了TempRAGEval基准测试,该基准通过对现有数据集进行时间扰动处理并添加黄金证据标签来重构它们。正如预期,所有现有的检索方法在处理这些需要密集时间推理的问题时都面临困难。我们进一步提出了模块化检索(MRAG),这是一个无需训练的框架,包含三个模块:(1)问题处理模块,将问题分解为主干内容和时间约束;(2)检索与摘要模块,检索证据并利用LLM根据主干内容进行摘要生成;(3)语义-时间混合排序模块,基于语义相关性和时间相关性对每条证据摘要进行评分。在TempRAGEval上,MRAG在检索性能上显著优于基线检索器,并进一步提升了最终答案的准确率。