Dense retrieval systems have proven to be effective across various benchmarks, but require substantial memory to store large search indices. Recent advances in embedding compression show that index sizes can be greatly reduced with minimal loss in ranking quality. However, existing studies often overlook the role of corpus complexity -- a critical factor, as recent work shows that both corpus size and document length strongly affect dense retrieval performance. In this paper, we introduce CoRECT (Controlled Retrieval Evaluation of Compression Techniques), a framework for large-scale evaluation of embedding compression methods, supported by a newly curated dataset collection. To demonstrate its utility, we benchmark eight representative types of compression methods. Notably, we show that non-learned compression achieves substantial index size reduction, even on up to 100M passages, with statistically insignificant performance loss. However, selecting the optimal compression method remains challenging, as performance varies across models. Such variability highlights the necessity of CoRECT to enable consistent comparison and informed selection of compression methods. All code, data, and results are available on GitHub and HuggingFace.
翻译:密集检索系统已在多种基准测试中被证明是有效的,但需要大量内存来存储庞大的搜索索引。嵌入压缩技术的最新进展表明,索引大小可以在排序质量损失最小的情况下大幅减小。然而,现有研究常常忽略了语料库复杂性的作用——这是一个关键因素,因为最近的研究表明,语料库规模和文档长度都会强烈影响密集检索的性能。本文介绍了CoRECT(受控检索的压缩技术评估框架),这是一个用于大规模评估嵌入压缩方法的框架,并得到了一个新整理的数据集集合的支持。为了展示其实用性,我们对八种具有代表性的压缩方法进行了基准测试。值得注意的是,我们的研究表明,即使处理多达1亿个段落,非学习型压缩也能实现显著的索引大小缩减,且性能损失在统计上不显著。然而,选择最优的压缩方法仍然具有挑战性,因为不同模型之间的性能存在差异。这种可变性凸显了CoRECT的必要性,它能够实现压缩方法的一致比较和明智选择。所有代码、数据和结果均可在GitHub和HuggingFace上获取。