Recently, various studies have been directed towards exploring dense passage retrieval techniques employing pre-trained language models, among which the masked auto-encoder (MAE) pre-training architecture has emerged as the most promising. The conventional MAE framework relies on leveraging the passage reconstruction of decoder to bolster the text representation ability of encoder, thereby enhancing the performance of resulting dense retrieval systems. Within the context of building the representation ability of the encoder through passage reconstruction of decoder, it is reasonable to postulate that a ``more demanding'' decoder will necessitate a corresponding increase in the encoder's ability. To this end, we propose a novel token importance aware masking strategy based on pointwise mutual information to intensify the challenge of the decoder. Importantly, our approach can be implemented in an unsupervised manner, without adding additional expenses to the pre-training phase. Our experiments verify that the proposed method is both effective and robust on large-scale supervised passage retrieval datasets and out-of-domain zero-shot retrieval benchmarks.
翻译:近期,多项研究致力于利用预训练语言模型探索密集段落检索技术,其中掩码自编码器(MAE)预训练架构已成为最具前景的方案。传统MAE框架依赖于利用解码器的段落重建来增强编码器的文本表示能力,从而提升最终密集检索系统的性能。在通过解码器的段落重建构建编码器表示能力的背景下,可以合理假设:一个"更具挑战性"的解码器将要求编码器能力相应提升。为此,我们基于逐点互信息提出一种新颖的令牌重要性感知掩码策略,以增强解码器的挑战性。重要的是,该方法可在无监督方式下实施,无需为预训练阶段增加额外开销。实验证实,所提方法在大规模监督段落检索数据集和跨域零样本检索基准上均表现出有效性与鲁棒性。