Current state-of-the-art multi-vector models are obtained through a small Knowledge Distillation (KD) training step on top of strong single-vector models, leveraging the large-scale pre-training of these models. In this paper, we study the pre-training of multi-vector models and show that large-scale multi-vector pre-training yields much stronger multi-vector models. Notably, a fully ColBERT-pre-trained model, ColBERT-Zero, trained only on public data, outperforms GTE-ModernColBERT as well as its base model, GTE-ModernBERT, which leverages closed and much stronger data, setting new state-of-the-art for model this size. We also find that, although performing only a small KD step is not enough to achieve results close to full pre-training, adding a supervised step beforehand allows to achieve much closer performance while skipping the most costly unsupervised phase. Finally, we find that aligning the fine-tuning and pre-training setups is crucial when repurposing existing models. To enable exploration of our results, we release various checkpoints as well as code used to train them.
翻译:当前最先进的多向量模型是通过在强大的单向量模型基础上进行小规模知识蒸馏训练步骤获得的,利用了这些模型的大规模预训练。本文研究了多向量模型的预训练过程,并证明大规模多向量预训练能产生更强大的多向量模型。值得注意的是,完全基于ColBERT预训练的模型ColBERT-Zero仅使用公开数据训练,其性能超越了GTE-ModernColBERT及其基础模型GTE-ModernBERT(后者利用了封闭且更强大的数据),为该规模模型设立了新的性能标杆。研究还发现,虽然仅进行小规模知识蒸馏步骤不足以达到接近完整预训练的效果,但若在蒸馏前增加监督训练阶段,则能在跳过最耗力的无监督预训练阶段的同时获得接近完整预训练的性能。最后,我们发现当重新利用现有模型时,对齐微调与预训练配置至关重要。为促进研究验证,我们发布了多个模型检查点及相应的训练代码。