Large language models (LLMs) generate high-dimensional embeddings that capture rich semantic and syntactic information. However, high-dimensional embeddings exacerbate computational complexity and storage requirements, thereby hindering practical deployment. To address these challenges, we propose a novel training framework named Sequential Matryoshka Embedding Compression (SMEC). This framework introduces the Sequential Matryoshka Representation Learning(SMRL) method to mitigate gradient variance during training, the Adaptive Dimension Selection (ADS) module to reduce information degradation during dimension pruning, and the Selectable Cross-batch Memory (S-XBM) module to enhance unsupervised learning between high- and low-dimensional embeddings. Experiments on image, text, and multimodal datasets demonstrate that SMEC achieves significant dimensionality reduction while maintaining performance. For instance, on the BEIR dataset, our approach improves the performance of compressed LLM2Vec embeddings (256 dimensions) by 1.1 points and 2.7 points compared to the Matryoshka-Adaptor and Search-Adaptor models, respectively.
翻译:大型语言模型(LLM)生成的高维嵌入能够捕捉丰富的语义和句法信息。然而,高维嵌入会加剧计算复杂度和存储需求,从而阻碍实际部署。为应对这些挑战,我们提出了一种名为序列套娃嵌入压缩(SMEC)的新型训练框架。该框架引入了序列套娃表示学习(SMRL)方法以缓解训练过程中的梯度方差,自适应维度选择(ADS)模块以减少维度剪枝过程中的信息退化,以及可选择的跨批次记忆(S-XBM)模块以增强高维与低维嵌入之间的无监督学习。在图像、文本和多模态数据集上的实验表明,SMEC在保持性能的同时实现了显著的维度降低。例如,在BEIR数据集上,我们的方法将压缩后的LLM2Vec嵌入(256维)性能相较于Matryoshka-Adaptor和Search-Adaptor模型分别提升了1.1个点和2.7个点。