Text retrieval using learned sparse representations of queries and documents has, over the years, evolved into a highly effective approach to search. It is thanks to recent advances in approximate nearest neighbor search-with the emergence of highly efficient algorithms such as the inverted index-based Seismic and the graph-based Hnsw-that retrieval with sparse representations became viable in practice. In this work, we scrutinize the efficiency of sparse retrieval algorithms and focus particularly on the size of a data structure that is common to all algorithmic flavors and that constitutes a substantial fraction of the overall index size: the forward index. In particular, we seek compression techniques to reduce the storage footprint of the forward index without compromising search quality or inner product computation latency. In our examination with various integer compression techniques, we report that StreamVByte achieves the best trade-off between memory footprint, retrieval accuracy, and latency. We then improve StreamVByte by introducing DotVByte, a new algorithm tailored to inner product computation. Experiments on MsMarco show that our improvements lead to significant space savings while maintaining retrieval efficiency.
翻译:基于查询和文档学习稀疏表示的文本检索方法,历经多年发展,已成为一种高效的搜索方法。近年来,近似最近邻搜索技术的进步——特别是基于倒排索引的高效算法Seismic和基于图的算法Hnsw的出现——使得稀疏表示检索在实际应用中变得可行。本研究深入分析了稀疏检索算法的效率,重点关注所有算法变体中共同存在且占据整体索引大小相当比例的数据结构:前向索引。具体而言,我们探索在不影响搜索质量或内积计算延迟的前提下,通过压缩技术减少前向索引的存储空间。通过对多种整数压缩技术的测试,我们发现StreamVByte在内存占用、检索准确性和延迟之间实现了最佳平衡。随后,我们通过提出专为内积计算设计的新算法DotVByte,对StreamVByte进行了改进。在MsMarco数据集上的实验表明,我们的改进在保持检索效率的同时,实现了显著的空间节省。