Recent advances in large language models have demonstrated remarkable effectiveness in information retrieval (IR) tasks. While many neural IR systems encode queries and documents into single-vector representations, multi-vector models elevate the retrieval quality by producing multi-vector representations and facilitating similarity searches at the granularity of individual tokens. However, these models significantly amplify memory and storage requirements for retrieval indices by an order of magnitude. This escalation in index size renders the scalability of multi-vector IR models progressively challenging due to their substantial memory demands. We introduce Embedding from Storage Pipelined Network (ESPN) where we offload the entire re-ranking embedding tables to SSDs and reduce the memory requirements by 5-16x. We design a software prefetcher with hit rates exceeding 90%, improving SSD based retrieval up to 6.4x, and demonstrate that we can maintain near memory levels of query latency even for large query batch sizes.
翻译:大语言模型的最新进展已在信息检索任务中展现出显著效果。虽然许多神经信息检索系统将查询和文档编码为单向量表示,但多向量模型通过生成多向量表示并支持在单个令牌粒度上进行相似性搜索,从而提升了检索质量。然而,这些模型将检索索引的内存和存储需求大幅提升了一个数量级。这种索引规模的扩大使得多向量信息检索模型因其巨大的内存需求而面临可扩展性挑战。我们提出了存储流水线网络嵌入(ESPN),将整个重排序嵌入表卸载到固态硬盘,从而将内存需求降低5-16倍。我们设计了一个命中率超过90%的软件预取器,将基于固态硬盘的检索速度提升高达6.4倍,并证明即使在较大的查询批次大小下,我们也能维持接近内存级别的查询延迟。