Dense retrievers have made significant strides in text retrieval and open-domain question answering, even though most achievements were made possible only with large amounts of human supervision. In this work, we aim to develop unsupervised methods by proposing two methods that create pseudo query-document pairs and train dense retrieval models in an annotation-free and scalable manner: query extraction and transferred query generation. The former method produces pseudo queries by selecting salient spans from the original document. The latter utilizes generation models trained for other NLP tasks (e.g., summarization) to produce pseudo queries. Extensive experiments show that models trained with the proposed augmentation methods can perform comparably well (or better) to multiple strong baselines. Combining those strategies leads to further improvements, achieving the state-of-the-art performance of unsupervised dense retrieval on both BEIR and ODQA datasets.
翻译:密集检索器在文本检索和开放域问答方面取得了显著进展,尽管这些成就大多依赖于大量人工监督。本文旨在开发无监督方法,提出两种创建伪查询-文档对的技术,以无标注且可扩展的方式训练密集检索模型:查询提取和查询迁移生成。前者通过从原始文档中选择显著性片段来生成伪查询,后者则利用为其他自然语言处理任务(如文本摘要)训练的生成模型来产生伪查询。大量实验表明,采用所提出的增强方法训练的模型可达到与多个强基线方法相当(或更优)的性能。结合这些策略可进一步改进,在无监督密集检索的BEIR和ODQA数据集上均取得最先进水平的性能。