Few-shot dense retrieval (DR) aims to effectively generalize to novel search scenarios by learning a few samples. Despite its importance, there is little study on specialized datasets and standardized evaluation protocols. As a result, current methods often resort to random sampling from supervised datasets to create "few-data" setups and employ inconsistent training strategies during evaluations, which poses a challenge in accurately comparing recent progress. In this paper, we propose a customized FewDR dataset and a unified evaluation benchmark. Specifically, FewDR employs class-wise sampling to establish a standardized "few-shot" setting with finely-defined classes, reducing variability in multiple sampling rounds. Moreover, the dataset is disjointed into base and novel classes, allowing DR models to be continuously trained on ample data from base classes and a few samples in novel classes. This benchmark eliminates the risk of novel class leakage, providing a reliable estimation of the DR model's few-shot ability. Our extensive empirical results reveal that current state-of-the-art DR models still face challenges in the standard few-shot scene. Our code and data will be open-sourced at https://github.com/OpenMatch/ANCE-Tele.
翻译:少样本密集检索旨在通过学习少量样本,有效泛化至新的搜索场景。尽管其重要性显著,但针对专用数据集与标准化评估协议的研究仍较为匮乏。因此,现有方法往往从监督数据集中随机采样以构建“少数据”场景,并在评估过程中采用不一致的训练策略,这给准确比较近期进展带来了挑战。本文提出定制化的FewDR数据集与统一的评估基准。具体而言,FewDR通过类别级采样建立具有精细类别的标准化“少样本”场景,从而降低多轮采样中的变异性。此外,该数据集被划分为基类和新类,使密集检索模型能够在基类的大量数据与新类的少量样本上持续训练。这一基准消除了新类信息泄露的风险,为密集检索模型的少样本能力提供了可靠评估。大量实验结果表明,当前最先进的密集检索模型在标准少样本场景中仍面临挑战。我们的代码与数据将在https://github.com/OpenMatch/ANCE-Tele 开源。