We introduce and define the novel problem of multi-distribution information retrieval (IR) where given a query, systems need to retrieve passages from within multiple collections, each drawn from a different distribution. Some of these collections and distributions might not be available at training time. To evaluate methods for multi-distribution retrieval, we design three benchmarks for this task from existing single-distribution datasets, namely, a dataset based on question answering and two based on entity matching. We propose simple methods for this task which allocate the fixed retrieval budget (top-k passages) strategically across domains to prevent the known domains from consuming most of the budget. We show that our methods lead to an average of 3.8+ and up to 8.0 points improvements in Recall@100 across the datasets and that improvements are consistent when fine-tuning different base retrieval models. Our benchmarks are made publicly available.
翻译:我们引入并定义了多分布信息检索(IR)这一新问题:在给定查询时,系统需要从多个集合中检索段落,每个集合均来自不同分布。其中部分集合及其分布可能在训练阶段不可见。为评估多分布检索方法,我们基于现有单分布数据集设计了三个基准测试任务,分别基于问答构建一个数据集,以及基于实体匹配构建两个数据集。我们针对该任务提出了简单方法,通过跨领域策略性地分配固定检索预算(top-k段落),以避免已知领域消耗大部分预算。实验表明,我们的方法在不同数据集上的Recall@100平均提升3.8个百分点以上,最高可达8.0个百分点,且在不同基础检索模型微调时改进效果一致。我们的基准测试已公开提供。