The landscape of information retrieval has broadened from search services to a critical component in various advanced applications, where indexing efficiency, cost-effectiveness, and freshness are increasingly important yet remain less explored. To address these demands, we introduce Semi-parametric Vocabulary Disentangled Retrieval (SVDR). SVDR is a novel semi-parametric retrieval framework that supports two types of indexes: an embedding-based index for high effectiveness, akin to existing neural retrieval methods; and a binary token index that allows for quick and cost-effective setup, resembling traditional term-based retrieval. In our evaluation on three open-domain question answering benchmarks with the entire Wikipedia as the retrieval corpus, SVDR consistently demonstrates superiority. It achieves a 3% higher top-1 retrieval accuracy compared to the dense retriever DPR when using an embedding-based index and an 9% higher top-1 accuracy compared to BM25 when using a binary token index. Specifically, the adoption of a binary token index reduces index preparation time from 30 GPU hours to just 2 CPU hours and storage size from 31 GB to 2 GB, achieving a 90% reduction compared to an embedding-based index.
翻译:信息检索的范畴已从搜索服务扩展到各类高级应用中的关键组件,其中索引效率、成本效益和时效性日益重要但探索尚不充分。为应对这些需求,我们提出半参数词汇解耦检索(Semi-parametric Vocabulary Disentangled Retrieval, SVDR)。SVDR是一种新型半参数检索框架,支持两种索引类型:基于嵌入的索引(类似现有神经检索方法,可保障高检索效能)和二进制标记索引(可实现快速、低成本的索引构建,类似传统词项检索)。我们在三个以完整维基百科为检索语料库的开放域问答基准上进行的评估表明,SVDR始终展现出优越性:使用基于嵌入的索引时,其前1检索准确率比密集检索器DPR高出3%;采用二进制标记索引时,其前1准确率比BM25高出9%。值得注意的是,采用二进制标记索引可将索引准备时间从30 GPU小时缩短至仅2 CPU小时,并将存储空间从31 GB压缩至2 GB,与基于嵌入的索引相比减少达90%。