Dual encoders are now the dominant architecture for dense retrieval. Yet, we have little understanding of how they represent text, and why this leads to good performance. In this work, we shed light on this question via distributions over the vocabulary. We propose to interpret the vector representations produced by dual encoders by projecting them into the model's vocabulary space. We show that the resulting projections contain rich semantic information, and draw connection between them and sparse retrieval. We find that this view can offer an explanation for some of the failure cases of dense retrievers. For example, we observe that the inability of models to handle tail entities is correlated with a tendency of the token distributions to forget some of the tokens of those entities. We leverage this insight and propose a simple way to enrich query and passage representations with lexical information at inference time, and show that this significantly improves performance compared to the original model in zero-shot settings, and specifically on the BEIR benchmark.
翻译:双编码器已成为稠密检索的主要架构。然而,我们对其如何表征文本以及为何能取得优异性能的理解仍十分有限。本研究通过词汇分布视角揭示这一问题。我们提出将双编码器生成的向量表征投影至模型词汇空间进行解释。结果表明,所得投影包含丰富的语义信息,并与稀疏检索存在内在关联。我们进一步发现,该视角可为稠密检索器的部分失效案例提供解释。例如,我们观察到模型处理尾部实体的能力不足,与词汇分布倾向于遗忘这些实体的部分token相关。基于这一洞见,我们提出一种简单方法,在推理阶段为查询与段落表征注入词汇信息。实验证明,该方法在零样本场景下(尤其在BEIR基准中)相比原始模型可显著提升性能。