Typos-aware Bottlenecked Pre-Training for Robust Dense Retrieval

Current dense retrievers (DRs) are limited in their ability to effectively process misspelled queries, which constitute a significant portion of query traffic in commercial search engines. The main issue is that the pre-trained language model-based encoders used by DRs are typically trained and fine-tuned using clean, well-curated text data. Misspelled queries are typically not found in the data used for training these models, and thus misspelled queries observed at inference time are out-of-distribution compared to the data used for training and fine-tuning. Previous efforts to address this issue have focused on \textit{fine-tuning} strategies, but their effectiveness on misspelled queries remains lower than that of pipelines that employ separate state-of-the-art spell-checking components. To address this challenge, we propose ToRoDer (TypOs-aware bottlenecked pre-training for RObust DEnse Retrieval), a novel re-training strategy for DRs that increases their robustness to misspelled queries while preserving their effectiveness in downstream retrieval tasks. ToRoDer utilizes an encoder-decoder architecture where the encoder takes misspelled text with masked tokens as input and outputs bottlenecked information to the decoder. The decoder then takes as input the bottlenecked embeddings, along with token embeddings of the original text with the misspelled tokens masked out. The pre-training task is to recover the masked tokens for both the encoder and decoder. Our extensive experimental results and detailed ablation studies show that DRs pre-trained with ToRoDer exhibit significantly higher effectiveness on misspelled queries, sensibly closing the gap with pipelines that use a separate, complex spell-checker component, while retaining their effectiveness on correctly spelled queries.

翻译：当前密集检索器处理拼写错误查询的能力有限，而这在商业搜索引擎的查询流量中占据相当比例。主要问题在于，密集检索器使用的基于预训练语言模型的编码器通常使用干净、精心整理的文本数据进行训练和微调。拼写错误的查询通常不会出现在用于训练这些模型的数据中，因此在推理时观察到的拼写错误查询与训练和微调所使用的数据相比属于分布外样本。先前解决该问题的努力集中于微调策略，但其对拼写错误查询的有效性仍低于采用独立先进拼写检查组件的流水线方法。为应对这一挑战，我们提出ToRoDer（面向鲁棒密集检索的拼写感知瓶颈预训练），这是一种针对密集检索器的新型再训练策略，可增强其对拼写错误查询的鲁棒性，同时保持其在下游检索任务中的有效性。ToRoDer采用编码器-解码器架构，其中编码器以带有掩码标记的拼写错误文本作为输入，并向解码器输出瓶颈化信息。解码器随后以瓶颈化嵌入以及原始文本中拼写错误标记被掩码后的标记嵌入作为输入。预训练任务是同时恢复编码器和解码器的掩码标记。我们广泛的实验结果与详细的消融研究表明，使用ToRoDer预训练的密集检索器在拼写错误查询上表现出显著更高的有效性，合理缩小了与使用独立复杂拼写检查组件的流水线方法之间的差距，同时保持其对正确拼写查询的有效性。