In Information Retrieval, and more generally in Natural Language Processing, adapting models to specific domains is conducted through fine-tuning. Despite the successes achieved by this method and its versatility, the need for human-curated and labeled data makes it impractical to transfer to new tasks, domains, and/or languages when training data doesn't exist. Using the model without training (zero-shot) is another option that however suffers an effectiveness cost, especially in the case of first-stage retrievers. Numerous research directions have emerged to tackle these issues, most of them in the context of adapting to a task or a language. However, the literature is scarcer for domain (or topic) adaptation. In this paper, we address this issue of cross-topic discrepancy for a sparse first-stage retriever by transposing a method initially designed for language adaptation. By leveraging pre-training on the target data to learn domain-specific knowledge, this technique alleviates the need for annotated data and expands the scope of domain adaptation. Despite their relatively good generalization ability, we show that even sparse retrievers can benefit from our simple domain adaptation method.
翻译:在信息检索乃至更广泛的自然语言处理领域,模型针对特定领域的适配通常通过微调实现。尽管该方法取得了显著成功且具有普适性,但当训练数据缺失时,依赖人工标注数据的需求使其难以迁移至新任务、新领域或新语言。直接使用未经训练的模型(零样本)是另一种选择,但会牺牲检索效能,尤其在首阶段检索器中更为明显。为应对这些问题,学界已涌现大量研究方向,其中多数聚焦于任务或语言适配。然而,针对领域(或主题)自适应的研究文献相对匮乏。本文通过移植一种最初为语言适配设计的方法,解决了稀疏首阶段检索器中的跨主题差异问题。该技术利用目标数据的预训练来学习领域特定知识,从而减少对标注数据的依赖,并拓展了领域自适应的适用范围。实验表明,即使具有相对良好泛化能力的稀疏检索器,也能从我们提出的简易领域自适应方法中获益。