In Information Retrieval, and more generally in Natural Language Processing, adapting models to specific domains is conducted through fine-tuning. Despite the successes achieved by this method and its versatility, the need for human-curated and labeled data makes it impractical to transfer to new tasks, domains, and/or languages when training data doesn't exist. Using the model without training (zero-shot) is another option that however suffers an effectiveness cost, especially in the case of first-stage retrievers. Numerous research directions have emerged to tackle these issues, most of them in the context of adapting to a task or a language. However, the literature is scarcer for domain (or topic) adaptation. In this paper, we address this issue of cross-topic discrepancy for a sparse first-stage retriever by transposing a method initially designed for language adaptation. By leveraging pre-training on the target data to learn domain-specific knowledge, this technique alleviates the need for annotated data and expands the scope of domain adaptation. Despite their relatively good generalization ability, we show that even sparse retrievers can benefit from our simple domain adaptation method.
翻译:在信息检索及更广泛的自然语言处理领域中,通过微调实现模型对特定领域的适配已成为常规做法。尽管该方法取得了显著成功且具有高度灵活性,但因其依赖人工标注数据,在缺乏训练数据的新任务、领域或语种迁移场景中仍面临实践瓶颈。零样本推理提供了替代方案,但需付出有效性代价,尤其对初阶检索器而言。现有研究已提出诸多应对策略,多数聚焦于任务或语种适配,而针对领域(或主题)适配的文献相对匮乏。本文通过迁移一项原为语言适配设计的方法,着力解决稀疏型初阶检索器跨主题差异问题。该方法基于目标数据预训练以习得领域特异性知识,既规避了对标注数据的需求,又扩展了域适应范畴。研究表明,尽管稀疏检索器具有相对良好的泛化能力,本文提出的简易域适应方法仍能使其性能受益。