In information retrieval (IR), domain adaptation is the process of adapting a retrieval model to a new domain whose data distribution is different from the source domain. Existing methods in this area focus on unsupervised domain adaptation where they have access to the target document collection or supervised (often few-shot) domain adaptation where they additionally have access to (limited) labeled data in the target domain. There also exists research on improving zero-shot performance of retrieval models with no adaptation. This paper introduces a new category of domain adaptation in IR that is as-yet unexplored. Here, similar to the zero-shot setting, we assume the retrieval model does not have access to the target document collection. In contrast, it does have access to a brief textual description that explains the target domain. We define a taxonomy of domain attributes in retrieval tasks to understand different properties of a source domain that can be adapted to a target domain. We introduce a novel automatic data construction pipeline that produces a synthetic document collection, query set, and pseudo relevance labels, given a textual domain description. Extensive experiments on five diverse target domains show that adapting dense retrieval models using the constructed synthetic data leads to effective retrieval performance on the target domain.
翻译:在信息检索中,域适应是指将检索模型适配到数据分布与源域不同的新域的过程。现有方法主要关注无监督域适应(可访问目标文档集合)或有监督(通常为少样本)域适应(额外可访问目标域中的有限标注数据)。也有研究致力于提升检索模型在零样本设置下的性能,无需任何适配。本文提出信息检索中一类尚未被探索的新域适应范式。在此设置中,与零样本场景类似,我们假设检索模型无法访问目标文档集合,但可获取描述目标域的简短文本说明。我们定义了检索任务中域属性的分类体系,以理解源域中可适配至目标域的不同特性。提出一种新颖的自动数据构建流水线,能够基于文本域描述生成合成文档集合、查询集及伪相关性标签。在五个不同目标域上的广泛实验表明,使用所构建的合成数据适配高密度检索模型可在目标域上实现有效的检索性能。