We introduce ART, a new corpus-level autoencoding approach for training dense retrieval models that does not require any labeled training data. Dense retrieval is a central challenge for open-domain tasks, such as Open QA, where state-of-the-art methods typically require large supervised datasets with custom hard-negative mining and denoising of positive examples. ART, in contrast, only requires access to unpaired inputs and outputs (e.g. questions and potential answer documents). It uses a new document-retrieval autoencoding scheme, where (1) an input question is used to retrieve a set of evidence documents, and (2) the documents are then used to compute the probability of reconstructing the original question. Training for retrieval based on question reconstruction enables effective unsupervised learning of both document and question encoders, which can be later incorporated into complete Open QA systems without any further finetuning. Extensive experiments demonstrate that ART obtains state-of-the-art results on multiple QA retrieval benchmarks with only generic initialization from a pre-trained language model, removing the need for labeled data and task-specific losses.
翻译:我们提出ART——一种新的语料库级自编码方法,用于训练无需任何标注训练数据的密集检索模型。密集检索是开放域任务(如开放问答)的核心挑战,现有最优方法通常需要大规模监督数据集,并配合自定义难负例挖掘和正例去噪。与此相反,ART仅需访问未配对的输入和输出(例如问题与潜在答案文档)。它采用一种新的文档检索自编码方案:(1)输入问题用于检索一组证据文档,然后(2)利用这些文档计算重构原始问题的概率。基于问题重构的检索训练方法能够有效实现文档编码器与问题编码器的无监督学习,且这些编码器后续可直接集成到完整的开放问答系统中,无需任何额外微调。大量实验表明,ART仅通过预训练语言模型的通用初始化,即可在多个问答检索基准上取得最优结果,彻底消除了对标注数据和任务特定损失函数的需求。