Pre-trained Transformers (\eg BERT) have been commonly used in existing dense retrieval methods for parameter initialization, and recent studies are exploring more effective pre-training tasks for further improving the quality of dense vectors. Although various novel and effective tasks have been proposed, their different input formats and learning objectives make them hard to be integrated for jointly improving the model performance. In this work, we aim to unify a variety of pre-training tasks into the bottlenecked masked autoencoder manner, and integrate them into a multi-task pre-trained model, namely MASTER. Concretely, MASTER utilizes a shared-encoder multi-decoder architecture that can construct a representation bottleneck to compress the abundant semantic information across tasks into dense vectors. Based on it, we integrate three types of representative pre-training tasks: corrupted passages recovering, related passages recovering and PLMs outputs recovering, to characterize the inner-passage information, inter-passage relations and PLMs knowledge. Extensive experiments have shown that our approach outperforms competitive dense retrieval methods. Our code and data are publicly released in \url{https://github.com/microsoft/SimXNS}.
翻译:预训练Transformer(如BERT)已被广泛用于现有密集检索方法的参数初始化,近期研究正在探索更有效的预训练任务以进一步提升密集向量的质量。尽管多种新颖且有效的任务被提出,但其不同的输入格式和学习目标使其难以整合以共同提升模型性能。本文旨在将多种预训练任务统一到瓶颈掩码自编码器框架中,并将其整合为多任务预训练模型MASTER。具体而言,MASTER采用共享编码器-多解码器架构,可构建表征瓶颈将跨任务的丰富语义信息压缩为密集向量。在此基础上,我们整合了三类代表性预训练任务:损坏段落恢复、相关段落恢复及PLM输出恢复,以刻画段落内部信息、段落间关系及PLM知识。大量实验表明,本方法优于竞争性密集检索方法。我们的代码与数据已公开于 \url{https://github.com/microsoft/SimXNS}。