We propose Latent Terms, a method revealing that models trained for dense retrieval, whether single- or multi-vector, learn representations that can trivially be decomposed into retrieval-ready sparse features. When trained on frozen retrievers, Sparse Autoencoders without any retrieval-specific adjustments extract a latent vocabulary with approximately Zipfian collection statistics, directly suitable for classical sparse retrieval scoring via BM25. This approach enables sparse retrieval while requiring no learned expansion objective or sparse retrieval supervision whatsoever, and can be readily applied to any dense retriever. Latent Terms is able to match or outperform single-vector scoring methods from its own base model as well as comparable SPLADE variants. In addition, it substantially outperforms its base model on LIMIT, a task specifically designed to highlight the failures of single-vector retrieval. Overall, our results highlight that neural retrievers contain more expressive and indexable structure than their default scoring functions expose, but that other methods can nonetheless be leveraged.
翻译:我们提出"潜在术语"方法,揭示了经过密集检索训练的模型(无论是单向量还是多向量)所学习的表示,能够被轻易分解为可直接用于检索的稀疏特征。当基于冻结的检索器进行训练时,未经任何检索特定调整的稀疏自编码器能够提取出具有近似齐夫分布集合统计特征的潜在词汇表,可直接通过BM25进行经典稀疏检索评分。该方法无需任何学习扩展目标或稀疏检索监督即可实现稀疏检索,且可便捷应用于任意密集检索器。潜在术语方法能够匹配甚至超越其基础模型自身的单向量评分方法以及同类SPLADE变体的性能。此外,在专为突显单向量检索缺陷而设计的LIMIT任务中,其性能显著优于基础模型。整体而言,我们的研究结果表明,神经检索器包含的表达能力和可索引结构远超默认评分函数所能呈现的,但其他方法仍可加以利用。