The success of contextual word representations and advances in neural information retrieval have made dense vector-based retrieval a standard approach for passage and document ranking. While effective and efficient, dual-encoders are brittle to variations in query distributions and noisy queries. Data augmentation can make models more robust but introduces overhead to training set generation and requires retraining and index regeneration. We present Contrastive Alignment POst Training (CAPOT), a highly efficient finetuning method that improves model robustness without requiring index regeneration, the training set optimization, or alteration. CAPOT enables robust retrieval by freezing the document encoder while the query encoder learns to align noisy queries with their unaltered root. We evaluate CAPOT noisy variants of MSMARCO, Natural Questions, and Trivia QA passage retrieval, finding CAPOT has a similar impact as data augmentation with none of its overhead.
翻译:上下文词汇表示的成功与神经信息检索的进步,使得基于密集向量的检索成为段落和文档排序的标准方法。尽管高效且有效,但双编码器对查询分布的变化和噪声查询较为脆弱。数据增强虽能提升模型鲁棒性,但会增加训练集生成的开销,且需要重新训练和索引重建。我们提出了对比对齐后训练(CAPOT),这是一种高效微调方法,可在无需索引重建、训练集优化或修改的情况下提升模型鲁棒性。CAPOT通过冻结文档编码器,使查询编码器学习将噪声查询与其未受干扰的原始查询对齐,从而实现鲁棒检索。我们在MSMARCO、Natural Questions和Trivia QA段落检索的噪声变体上评估了CAPOT,发现其效果与数据增强相当,且无任何额外开销。