Effective approaches that can scale embedding model depth (i.e. layers) and embedding size allow for the creation of models that are highly scalable across different computational resources and task requirements. While the recently proposed 2D Matryoshka training approach can efficiently produce a single embedding model such that its sub-layers and sub-dimensions can measure text similarity, its effectiveness is significantly worse than if smaller models were trained separately. To address this issue, we propose Starbucks, a new training strategy for Matryoshka-like embedding models, which encompasses both the fine-tuning and pre-training phases. For the fine-tuning phase, we discover that, rather than sampling a random sub-layer and sub-dimensions for each training steps, providing a fixed list of layer-dimension pairs, from small size to large sizes, and computing the loss across all pairs significantly improves the effectiveness of 2D Matryoshka embedding models, bringing them on par with their separately trained counterparts. To further enhance performance, we introduce a new pre-training strategy, which applies masked autoencoder language modelling to sub-layers and sub-dimensions during pre-training, resulting in a stronger backbone for subsequent fine-tuning of the embedding model. Experimental results on both semantic text similarity and retrieval benchmarks demonstrate that the proposed pre-training and fine-tuning strategies significantly improved the effectiveness over 2D Matryoshka models, enabling Starbucks models to perform more efficiently and effectively than separately trained models.
翻译:能够扩展嵌入模型深度(即层数)和嵌入尺寸的有效方法,使得创建能够适应不同计算资源和任务需求的高度可扩展模型成为可能。虽然最近提出的二维嵌套训练方法能够高效地生成单一嵌入模型,使得其子层和子维度能够度量文本相似性,但其效果远逊于单独训练较小模型。为解决这一问题,我们提出星巴克——一种适用于类嵌套嵌入模型的新训练策略,涵盖微调与预训练两个阶段。在微调阶段,我们发现相较于每个训练步骤随机采样子层和子维度,提供一个从小尺寸到大尺寸的固定层-维度对列表,并计算所有配对的损失,能够显著提升二维嵌套嵌入模型的效果,使其达到与单独训练模型相当的水平。为进一步提升性能,我们引入一种新的预训练策略,在预训练期间对子层和子维度应用掩码自编码器语言建模,从而为后续嵌入模型的微调提供更强的骨干网络。在语义文本相似性和检索基准测试上的实验结果表明,所提出的预训练与微调策略较二维嵌套模型显著提升了效果,使星巴克模型能够比单独训练的模型更高效、更有效地执行任务。