In this paper, we consider the problem of improving the inference latency of language model-based dense retrieval systems by introducing structural compression and model size asymmetry between the context and query encoders. First, we investigate the impact of pre and post-training compression on the MSMARCO, Natural Questions, TriviaQA, SQUAD, and SCIFACT, finding that asymmetry in the dual encoders in dense retrieval can lead to improved inference efficiency. Knowing this, we introduce Kullback Leibler Alignment of Embeddings (KALE), an efficient and accurate method for increasing the inference efficiency of dense retrieval methods by pruning and aligning the query encoder after training. Specifically, KALE extends traditional Knowledge Distillation after bi-encoder training, allowing for effective query encoder compression without full retraining or index generation. Using KALE and asymmetric training, we can generate models which exceed the performance of DistilBERT despite having 3x faster inference.
翻译:本文研究了通过引入结构压缩以及上下文编码器与查询编码器之间的模型规模不对称性,来提升基于语言模型的稠密检索系统推理延迟的问题。首先,我们在MSMARCO、Natural Questions、TriviaQA、SQUAD和SCIFACT五个数据集上探究了预训练压缩与后训练压缩的影响,发现稠密检索中双编码器的非对称性能够提高推理效率。基于此发现,我们提出了Kullback-Leibler嵌入对齐方法(KALE),这是一种高效且精确的技术,通过剪枝训练后的查询编码器并对其进行对齐,来提升稠密检索方法的推理效率。具体而言,KALE在双编码器训练后扩展了传统知识蒸馏方法,无需完全重新训练或生成索引即可实现有效的查询编码器压缩。通过KALE与非对称训练,我们生成的模型在推理速度提升3倍的情况下,性能仍超过DistilBERT。