Information retrieval aims to find information that meets users' needs from the corpus. Different needs correspond to different IR tasks such as document retrieval, open-domain question answering, retrieval-based dialogue, etc., while they share the same schema to estimate the relationship between texts. It indicates that a good IR model can generalize to different tasks and domains. However, previous studies indicate that state-of-the-art neural information retrieval (NIR) models, e.g, pre-trained language models (PLMs) are hard to generalize. Mainly because the end-to-end fine-tuning paradigm makes the model overemphasize task-specific signals and domain biases but loses the ability to capture generalized essential signals. To address this problem, we propose a novel NIR training framework named NIR-Prompt for retrieval and reranking stages based on the idea of decoupling signal capturing and combination. NIR-Prompt exploits Essential Matching Module (EMM) to capture the essential matching signals and gets the description of tasks by Matching Description Module (MDM). The description is used as task-adaptation information to combine the essential matching signals to adapt to different tasks. Experiments under in-domain multi-task, out-of-domain multi-task, and new task adaptation settings show that NIR-Prompt can improve the generalization of PLMs in NIR for both retrieval and reranking stages compared with baselines.
翻译:信息检索旨在从语料库中寻找满足用户需求的信息。不同需求对应不同的信息检索任务,如文档检索、开放域问答、检索式对话等,但这些任务共享同一文本关系评估模式。这表明优秀的信息检索模型能够泛化至不同任务与领域。然而,先前研究表明,最先进的神经信息检索模型(如预训练语言模型)难以实现泛化,其主要原因是端到端微调范式导致模型过度关注特定任务信号与领域偏差,而丧失了捕捉泛化关键信号的能力。针对该问题,我们基于信号捕获与解耦组合的思想,提出一种名为NIR-Prompt的新型神经信息检索训练框架,适用于检索与重排序阶段。该框架利用基础匹配模块捕获核心匹配信号,并通过匹配描述模块获取任务描述,将任务描述作为自适应信息整合核心匹配信号以适配不同任务。在领域内多任务、跨领域多任务及新任务适配场景下的实验表明,与基线方法相比,NIR-Prompt能有效提升预训练语言模型在神经信息检索的检索与重排序阶段的泛化能力。