Although existing neural retrieval models reveal promising results when training data is abundant and the performance keeps improving as training data increases, collecting high-quality annotated data is prohibitively costly. To this end, we introduce a novel noisy self-training framework combined with synthetic queries, showing that neural retrievers can be improved in a self-evolution manner with no reliance on any external models. Experimental results show that our method improves consistently over existing methods on both general-domain (e.g., MS-MARCO) and out-of-domain (i.e., BEIR) retrieval benchmarks. Extra analysis on low-resource settings reveals that our method is data efficient and outperforms competitive baselines, with as little as 30% of labelled training data. Further extending the framework for reranker training demonstrates that the proposed method is general and yields additional gains on tasks of diverse domains.\footnote{Source code is available at \url{https://github.com/Fantabulous-J/Self-Training-DPR}}
翻译:尽管现有神经检索模型在训练数据充足时展现出令人瞩目的成果,且性能随训练数据增加持续提升,但收集高质量标注数据的成本却极为高昂。为此,我们提出一种结合合成查询的新型含噪自训练框架,证明神经检索器能够以无需依赖任何外部模型的方式实现自我进化式改进。实验结果表明,我们的方法在通用领域(如MS-MARCO)和跨领域(即BEIR)检索基准测试中均持续优于现有方法。针对低资源场景的额外分析显示,该方法具有数据高效性,仅需30%的标注训练数据即可超越多个具有竞争力的基线模型。将该框架进一步扩展至重排序器训练后,验证了所提方法的通用性,能在不同领域任务中带来额外性能提升。\footnote{源代码地址:\url{https://github.com/Fantabulous-J/Self-Training-DPR}}