Typos-aware Bottlenecked Pre-Training for Robust Dense Retrieval

Current dense retrievers (DRs) are limited in their ability to effectively process misspelled queries, which constitute a significant portion of query traffic in commercial search engines. The main issue is that the pre-trained language model-based encoders used by DRs are typically trained and fine-tuned using clean, well-curated text data. Misspelled queries are typically not found in the data used for training these models, and thus misspelled queries observed at inference time are out-of-distribution compared to the data used for training and fine-tuning. Previous efforts to address this issue have focused on \textit{fine-tuning} strategies, but their effectiveness on misspelled queries remains lower than that of pipelines that employ separate state-of-the-art spell-checking components. To address this challenge, we propose ToRoDer (TypOs-aware bottlenecked pre-training for RObust DEnse Retrieval), a novel \textit{pre-training} strategy for DRs that increases their robustness to misspelled queries while preserving their effectiveness in downstream retrieval tasks. ToRoDer utilizes an encoder-decoder architecture where the encoder takes misspelled text with masked tokens as input and outputs bottlenecked information to the decoder. The decoder then takes as input the bottlenecked embeddings, along with token embeddings of the original text with the misspelled tokens masked out. The pre-training task is to recover the masked tokens for both the encoder and decoder. Our extensive experimental results and detailed ablation studies show that DRs pre-trained with ToRoDer exhibit significantly higher effectiveness on misspelled queries, sensibly closing the gap with pipelines that use a separate, complex spell-checker component, while retaining their effectiveness on correctly spelled queries.

翻译：当前的稠密检索器在处理拼写错误的查询时能力有限，而这在商业搜索引擎的查询流量中占很大比例。主要问题在于，稠密检索器所使用的基于预训练语言模型的编码器通常是在干净、精心整理的文本数据上进行训练和微调的。拼写错误的查询通常不在这些模型训练所使用的数据中，因此在推理阶段观察到的拼写错误查询与训练和微调所用的数据相比存在分布外差异。以往解决该问题的努力集中在微调策略上，但这些策略对拼写错误查询的效果仍然低于采用独立的最先进拼写检查组件的流水线。为解决这一挑战，我们提出ToRoDer（面向鲁棒稠密检索的拼写感知瓶颈预训练方法），这是一种新颖的预训练策略，旨在增强稠密检索器对拼写错误查询的鲁棒性，同时保持其在下游检索任务中的有效性。ToRoDer采用编码器-解码器架构，其中编码器以带有掩码标记的拼写错误文本作为输入，并向解码器输出瓶颈信息。解码器则接收瓶颈嵌入以及原始文本中拼写错误标记被掩码后的标记嵌入。预训练任务是恢复编码器和解码器的掩码标记。我们的大量实验结果和详细消融研究表明，使用ToRoDer预训练的稠密检索器在拼写错误查询上表现出显著更高的有效性，合理缩小了与使用独立复杂拼写检查器组件的流水线之间的差距，同时保持了对正确拼写查询的有效性。