Multi-vector retrieval methods, exemplified by the ColBERT architecture, have shown substantial promise for retrieval by providing strong trade-offs in terms of retrieval latency and effectiveness. However, they come at a high cost in terms of storage since a (potentially compressed) vector needs to be stored for every token in the input collection. To overcome this issue, we propose encoding documents to a fixed number of vectors, which are no longer necessarily tied to the input tokens. Beyond reducing the storage costs, our approach has the advantage that document representations become of a fixed size on disk, allowing for better OS paging management. Through experiments using the MSMARCO passage corpus and BEIR with the ColBERT-v2 architecture, a representative multi-vector ranking model architecture, we find that passages can be effectively encoded into a fixed number of vectors while retaining most of the original effectiveness.
翻译:以ColBERT架构为代表的多向量检索方法通过提供检索延迟与效果之间的良好权衡,在检索任务中展现出显著潜力。然而,这些方法在存储方面成本高昂,因为需要为输入语料库中的每个词元存储一个(可能经过压缩的)向量。为解决此问题,我们提出将文档编码为固定数量的向量,这些向量不再必然与输入词元相关联。除降低存储成本外,本方法的优势在于文档表示在磁盘上具有固定大小,从而可实现更优的操作系统分页管理。通过使用MSMARCO段落语料库和BEIR数据集,在代表性多向量排序模型架构ColBERT-v2上进行实验,我们发现段落可被有效编码为固定数量的向量,同时保留原始检索效能的绝大部分。