Long-tail question answering presents significant challenges for large language models (LLMs) due to their limited ability to acquire and accurately recall less common knowledge. Retrieval-augmented generation (RAG) systems have shown great promise in mitigating this limitation by integrating external retrieval mechanisms. However, dense retrieval models often face the same difficulties when generalizing to rare or niche knowledge. In this study, we introduce RPDR, a novel data augmentation framework that selects high-quality easy-to-learn training data, to enhance dense retrievers. Our approach is built around three core components: synthetic data generation, data selection with Round-Trip prediction to identify easy-to-learn instances, and retriever training with these instances. We evaluate RPDR on two long-tail retrieval benchmarks, PopQA and EntityQuestion, demonstrating substantial improvements over existing retrievers like BM25 and Contriver, especially on extremely long-tail categories. We identify the strengths and limitations of RPDR through detailed human analysis and propose a dynamic routing mechanism to dynamically route queries to specialized retrieval modules to further improve retrieval performance.
翻译:长尾问答对大型语言模型(LLMs)提出了重大挑战,因为其获取并准确回忆较少见知识的能力有限。检索增强生成(RAG)系统通过集成外部检索机制,在缓解这一局限方面展现出巨大潜力。然而,密集检索模型在泛化至罕见或小众知识时,往往面临同样的困难。在本研究中,我们提出了RPDR,一种新颖的数据增强框架,通过选择高质量、易于学习的训练数据来增强密集检索器。我们的方法围绕三个核心组件构建:合成数据生成、利用往返预测进行数据选择以识别易于学习的实例,以及使用这些实例进行检索器训练。我们在两个长尾检索基准测试(PopQA和EntityQuestion)上评估了RPDR,结果表明其相较于BM25和Contriver等现有检索器有显著提升,尤其是在极端长尾类别上。通过详细的人工分析,我们明确了RPDR的优势与局限,并提出了一种动态路由机制,可将查询动态路由至专门的检索模块,以进一步提升检索性能。