Selecting which Dense Retriever to use for Zero-Shot Search

We propose the new problem of choosing which dense retrieval model to use when searching on a new collection for which no labels are available, i.e. in a zero-shot setting. Many dense retrieval models are readily available. Each model however is characterized by very differing search effectiveness -- not just on the test portion of the datasets in which the dense representations have been learned but, importantly, also across different datasets for which data was not used to learn the dense representations. This is because dense retrievers typically require training on a large amount of labeled data to achieve satisfactory search effectiveness in a specific dataset or domain. Moreover, effectiveness gains obtained by dense retrievers on datasets for which they are able to observe labels during training, do not necessarily generalise to datasets that have not been observed during training. This is however a hard problem: through empirical experimentation we show that methods inspired by recent work in unsupervised performance evaluation with the presence of domain shift in the area of computer vision and machine learning are not effective for choosing highly performing dense retrievers in our setup. The availability of reliable methods for the selection of dense retrieval models in zero-shot settings that do not require the collection of labels for evaluation would allow to streamline the widespread adoption of dense retrieval. This is therefore an important new problem we believe the information retrieval community should consider. Implementation of methods, along with raw result files and analysis scripts are made publicly available at https://www.github.com/anonymized.

翻译：我们提出了一个新的问题：在没有任何标签可用的情况下（即零样本设置中），如何选择用于新集合搜索的密集检索模型。目前有许多现成的密集检索模型，但每个模型在搜索效果上存在显著差异——不仅在其用于学习密集表示的数据集测试部分，更重要的是，在未用于训练密集表示的不同数据集上也是如此。这是因为密集检索器通常需要大量标注数据进行训练，才能在特定数据集或领域获得令人满意的搜索效果。此外，密集检索器在训练时能观察到标签的数据集上取得的效果提升，并不一定会泛化到训练时未出现的数据集上。然而，这是一个难题：通过实证实验，我们发现受计算机视觉和机器学习领域近期无监督性能评估（存在领域偏移）工作启发的方法，在我们设置中并不适用于选择高性能密集检索器。在零样本设置中，若存在无需收集评估标签的可靠密集检索模型选择方法，将有助于推广密集检索的广泛应用。因此，我们认为这是信息检索社区应考虑的一个重要新问题。相关方法实现、原始结果文件和分析脚本已公开于 https://www.github.com/anonymized。