LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations

We present LEAF ("Lightweight Embedding Alignment Framework"), a knowledge distillation framework for text embedding models. A key distinguishing feature is that our distilled leaf models are aligned to their teacher. In the context of information retrieval, this allows for flexible asymmetric architectures where documents are encoded with the larger teacher model, while queries can be served with the smaller leaf models. We also show that leaf models automatically inherit MRL and robustness to output quantization whenever these properties are present in the teacher model, without explicitly training for them. To demonstrate the capability of our framework we publish leaf-ir, a 23M parameters information retrieval oriented text embedding model trained using LEAF, which sets a new state-of-the-art (SOTA) on BEIR, ranking #1 on the public leaderboard for this benchmark and for models of its size. When run in asymmetric mode, its retrieval performance is further increased. Our scheme is however not restricted to the information retrieval setting, and we demonstrate its wider applicability by synthesizing the multi-task leaf-mt model. This also sets a new SOTA, ranking #1 on the public MTEB v2 (English) leaderboard for its size. LEAF is applicable to black-box models and in contrast to other embedding model training frameworks, it does not require judgments nor hard negatives, and training can be conducted using small batch sizes. Thus, dataset and training infrastructure requirements for our framework are modest. We make our models publicly available under a permissive Apache 2.0 license.

翻译：摘要: 我们提出LEAF（轻量级嵌入对齐框架），一种面向文本嵌入模型的知识蒸馏框架。其核心特点是：蒸馏得到的轻量级嵌入模型（leaf模型）与教师模型保持对齐。在信息检索场景中，该特性支持灵活的非对称架构——文档由更大规模的教师模型编码，而查询可由更小的leaf模型服务。我们还证明，当教师模型具备多尺度表示学习（MRL）和输出量化鲁棒性时，leaf模型无需显式训练即可自动继承这些特性。为展示框架能力，我们发布leaf-ir——一个采用LEAF训练的2300万参数信息检索导向文本嵌入模型，该模型在BEIR基准测试中创下最新最优性能（SOTA），在同等规模模型的公开排行榜上排名第一。当以非对称模式运行时，其检索性能进一步提升。然而，本方案并不局限于信息检索场景：通过合成多任务leaf-mt模型，我们证明其更广泛适用性。该模型在其规模组别的MTEB v2（英文）公开排行榜上同样取得SOTA排名第一。LEAF适用于黑盒模型，且与其他嵌入模型训练框架不同，它不需要人工标注判断数据或困难负样本，训练可采用小批量尺寸完成。因此，本框架对数据集和训练基础设施的要求较低。我们将在Apache 2.0宽松开源许可证下公开发布所有模型。