The dual-encoder has become the de facto architecture for dense retrieval. Typically, it computes the latent representations of the query and document independently, thus failing to fully capture the interactions between the query and document. To alleviate this, recent research has focused on obtaining query-informed document representations. During training, it expands the document with a real query, but during inference, it replaces the real query with a generated one. This inconsistency between training and inference causes the dense retrieval model to prioritize query information while disregarding the document when computing the document representation. Consequently, it performs even worse than the vanilla dense retrieval model because its performance heavily relies on the relevance between the generated queries and the real query.In this paper, we propose a curriculum sampling strategy that utilizes pseudo queries during training and progressively enhances the relevance between the generated query and the real query. By doing so, the retrieval model learns to extend its attention from the document alone to both the document and query, resulting in high-quality query-informed document representations. Experimental results on both in-domain and out-of-domain datasets demonstrate that our approach outperforms previous dense retrieval models.
翻译:双编码器已成为稠密检索的事实标准架构。该架构独立计算查询与文档的潜在表示,因而无法充分捕捉查询与文档之间的交互。为缓解此问题,近期研究聚焦于获取查询感知的文档表示:训练阶段用真实查询扩展文档,推理阶段则替换为生成查询。这种训练与推理的不一致性导致稠密检索模型在计算文档表示时过度关注查询信息,反而忽视文档内容。更严重的是,由于模型性能高度依赖生成查询与真实查询的相关性,其表现甚至劣于传统稠密检索模型。本文提出一种课程采样策略,通过训练阶段引入伪查询,逐步增强生成查询与真实查询之间的相关性。由此,检索模型学会将注意力从单一文档扩展至文档与查询的联合表征,生成高质量的查询感知文档表示。在域内与跨域数据集上的实验结果表明,本方法优于现有稠密检索模型。