Domain transfer is a prevalent challenge in modern neural Information Retrieval (IR). To overcome this problem, previous research has utilized domain-specific manual annotations and synthetic data produced by consistency filtering to finetune a general ranker and produce a domain-specific ranker. However, training such consistency filters are computationally expensive, which significantly reduces the model efficiency. In addition, consistency filtering often struggles to identify retrieval intentions and recognize query and corpus distributions in a target domain. In this study, we evaluate a more efficient solution: replacing the consistency filter with either direct pseudo-labeling, pseudo-relevance feedback, or unsupervised keyword generation methods for achieving consistent filtering-free unsupervised dense retrieval. Our extensive experimental evaluations demonstrate that, on average, TextRank-based pseudo relevance feedback outperforms other methods. Furthermore, we analyzed the training and inference efficiency of the proposed paradigm. The results indicate that filtering-free unsupervised learning can continuously improve training and inference efficiency while maintaining retrieval performance. In some cases, it can even improve performance based on particular datasets.
翻译:领域迁移是当前神经信息检索(IR)中的普遍挑战。为解决此问题,以往研究利用领域特定的手动标注数据以及通过一致性过滤生成的合成数据,对通用排序器进行微调以产生领域特定排序器。然而,训练此类一致性过滤器的计算成本高昂,严重降低了模型效率。此外,一致性过滤往往难以识别检索意图,也无法准确把握目标领域中的查询与语料分布。本研究评估了一种更高效的方案:用直接伪标签法、伪相关反馈或无监督关键词生成方法替代一致性过滤器,以实现免一致性过滤的无监督稠密检索。大量实验评估表明,基于TextRank的伪相关反馈方法在平均性能上优于其他方法。此外,我们分析了所提出范式的训练与推理效率。结果显示,免过滤无监督学习能够在保持检索性能的同时持续提升训练与推理效率;在某些情况下,甚至能基于特定数据集进一步提升性能。