Lexical and semantic matching capture different successful approaches to text retrieval and the fusion of their results has proven to be more effective and robust than either alone. Prior work performs hybrid retrieval by conducting lexical and semantic matching using different systems (e.g., Lucene and Faiss, respectively) and then fusing their model outputs. In contrast, our work integrates lexical representations with dense semantic representations by densifying high-dimensional lexical representations into what we call low-dimensional dense lexical representations (DLRs). Our experiments show that DLRs can effectively approximate the original lexical representations, preserving effectiveness while improving query latency. Furthermore, we can combine dense lexical and semantic representations to generate dense hybrid representations (DHRs) that are more flexible and yield faster retrieval compared to existing hybrid techniques. In addition, we explore it jointly training lexical and semantic representations in a single model and empirically show that the resulting DHRs are able to combine the advantages of the individual components. Our best DHR model is competitive with state-of-the-art single-vector and multi-vector dense retrievers in both in-domain and zero-shot evaluation settings. Furthermore, our model is both faster and requires smaller indexes, making our dense representation framework an attractive approach to text retrieval. Our code is available at https://github.com/castorini/dhr.
翻译:词汇匹配与语义匹配分别代表了文本检索中两种不同的成功方法,而两者结果的融合已被证明比单一方法更为有效和鲁棒。以往的工作通过使用不同系统(例如Lucene和Faiss)分别进行词汇匹配与语义匹配来实现混合检索,随后对其模型输出进行融合。相比之下,我们的工作通过将高维词汇表示稠密化为我们称之为低维稠密词汇表示(DLRs)的方法,将词汇表示与稠密语义表示相整合。实验表明,DLRs能够有效逼近原始词汇表示,在保持检索效果的同时降低查询延迟。此外,我们可以将稠密词汇表示与语义表示相结合,生成更灵活且比现有混合技术检索速度更快的稠密混合表示(DHRs)。进一步地,我们探索在单一模型中联合训练词汇表示与语义表示,并通过实验证明,所得到的DHRs能够融合各独立组件的优势。在领域内与零样本评估场景下,我们最佳的DHR模型均与当前最先进的单向量与多向量稠密检索器相匹敌。此外,我们的模型不仅检索速度更快,而且所需索引更小,使得我们的稠密表示框架成为文本检索领域一种具有吸引力的方法。我们的代码开源在https://github.com/castorini/dhr。