Masked auto-encoder pre-training has emerged as a prevalent technique for initializing and enhancing dense retrieval systems. It generally utilizes additional Transformer decoder blocks to provide sustainable supervision signals and compress contextual information into dense representations. However, the underlying reasons for the effectiveness of such a pre-training technique remain unclear. The usage of additional Transformer-based decoders also incurs significant computational costs. In this study, we aim to shed light on this issue by revealing that masked auto-encoder (MAE) pre-training with enhanced decoding significantly improves the term coverage of input tokens in dense representations, compared to vanilla BERT checkpoints. Building upon this observation, we propose a modification to the traditional MAE by replacing the decoder of a masked auto-encoder with a completely simplified Bag-of-Word prediction task. This modification enables the efficient compression of lexical signals into dense representations through unsupervised pre-training. Remarkably, our proposed method achieves state-of-the-art retrieval performance on several large-scale retrieval benchmarks without requiring any additional parameters, which provides a 67% training speed-up compared to standard masked auto-encoder pre-training with enhanced decoding.
翻译:掩码自编码器预训练已成为初始化和提升密集检索系统的流行技术。它通常利用额外的Transformer解码器模块提供持续的监督信号,并将上下文信息压缩为密集表示。然而,这种预训练技术有效性的根本原因尚不明确。额外使用基于Transformer的解码器还会带来显著的计算开销。在本研究中,我们旨在阐明这一问题,揭示与标准BERT检查点相比,具有增强解码的掩码自编码器预训练可显著提高输入标记在密集表示中的词覆盖度。基于这一发现,我们提出对传统掩码自编码器进行改进,用完全简化的词袋预测任务替代掩码自编码器的解码器。这种改进使得能够通过无监督预训练将词法信号高效压缩为密集表示。值得注意的是,我们提出的方法在多个大规模检索基准上实现了最先进的检索性能,且无需额外参数,与标准增强解码的掩码自编码器预训练相比,训练速度提升了67%。