Learned sparse text embeddings have gained popularity due to their effectiveness in top-k retrieval and inherent interpretability. Their distributional idiosyncrasies, however, have long hindered their use in real-world retrieval systems. That changed with the recent development of approximate algorithms that leverage the distributional properties of sparse embeddings to speed up retrieval. Nonetheless, in much of the existing literature, evaluation has been limited to datasets with only a few million documents such as MSMARCO. It remains unclear how these systems behave on much larger datasets and what challenges lurk in larger scales. To bridge that gap, we investigate the behavior of state-of-the-art retrieval algorithms on massive datasets. We compare and contrast the recently-proposed Seismic and graph-based solutions adapted from dense retrieval. We extensively evaluate Splade embeddings of 138M passages from MsMarco-v2 and report indexing time and other efficiency and effectiveness metrics.
翻译:学习型稀疏文本嵌入因其在top-k检索中的高效性和固有的可解释性而广受欢迎。然而,其分布特性长期以来阻碍了其在现实世界检索系统中的应用。随着近期近似算法的发展,这一局面得以改变,这些算法利用稀疏嵌入的分布特性来加速检索。尽管如此,在现有的大部分文献中,评估仅限于包含数百万文档的数据集(如MSMARCO)。目前尚不清楚这些系统在更大规模数据集上的表现如何,以及在更大规模下潜藏着哪些挑战。为填补这一空白,我们研究了最先进的检索算法在海量数据集上的行为。我们比较并对比了近期提出的Seismic算法以及从稠密检索中借鉴的基于图的解决方案。我们基于MsMarco-v2中1.38亿个段落的Splade嵌入进行了广泛评估,并报告了索引时间及其他效率与效果指标。