Leveraging LLMs for Unsupervised Dense Retriever Ranking

In this paper we present Large Language Model Assisted Retrieval Model Ranking (LARMOR), an effective unsupervised approach that leverages LLMs for selecting which dense retriever to use on a test corpus (target). Dense retriever selection is crucial for many IR applications that rely on using dense retrievers trained on public corpora to encode or search a new, private target corpus. This is because when confronted with domain shift, where the downstream corpora, domains, or tasks of the target corpus differ from the domain/task the dense retriever was trained on, its performance often drops. Furthermore, when the target corpus is unlabeled, e.g., in a zero-shot scenario, the direct evaluation of the model on the target corpus becomes unfeasible. Unsupervised selection of the most effective pre-trained dense retriever becomes then a crucial challenge. Current methods for dense retriever selection are insufficient in handling scenarios with domain shift. Our proposed solution leverages LLMs to generate pseudo-relevant queries, labels and reference lists based on a set of documents sampled from the target corpus. Dense retrievers are then ranked based on their effectiveness on these generated pseudo-relevant signals. Notably, our method is the first approach that relies solely on the target corpus, eliminating the need for both training corpora and test labels. To evaluate the effectiveness of our method, we construct a large pool of state-of-the-art dense retrievers. The proposed approach outperforms existing baselines with respect to both dense retriever selection and ranking. We make our code and results publicly available at https://github.com/ielab/larmor/.

翻译：本文提出了一种名为大语言模型辅助检索模型排序（LARMOR）的有效非监督方法，该方法利用大语言模型来选择在测试语料库（目标语料）上使用哪种密集检索器。密集检索器选择对于许多信息检索应用至关重要，这些应用依赖于使用在公共语料库上训练的密集检索器来编码或搜索新的、私有的目标语料库。这是因为当面临领域偏移时，即目标语料库的下游语料库、领域或任务与密集检索器训练时的领域/任务不同时，其性能通常会下降。此外，当目标语料库未标注时，例如在零样本场景中，直接在目标语料库上评估模型变得不可行。因此，非监督地选择最有效的预训练密集检索器成为一个关键挑战。当前用于密集检索器选择的方法在处理领域偏移场景时存在不足。我们提出的解决方案利用大语言模型，基于从目标语料库中采样的一组文档生成伪相关查询、标签和参考列表。然后，根据密集检索器在这些生成的伪相关信号上的有效性对其进行排序。值得注意的是，我们的方法是首个仅依赖目标语料库的方法，无需训练语料库和测试标签。为了评估我们方法的有效性，我们构建了一个包含大量最先进密集检索器的大型池。所提出的方法在密集检索器选择和排序方面均优于现有基线。我们在 https://github.com/ielab/larmor/ 公开了代码和结果。