Text embedding models are widely used for semantic similarity tasks, including information retrieval, clustering, and classification. General-purpose models are typically trained with single- or multi-stage processes using contrastive loss functions. We introduce a novel training regimen that combines model distillation techniques with task-specific contrastive loss to produce compact, high-performance embedding models. Our findings suggest that this approach is more effective for training small models than purely contrastive or distillation-based training paradigms alone. Benchmark scores for the resulting models, jina-embeddings-v5-text-small and jina-embeddings-v5-text-nano, exceed or match the state-of-the-art for models of similar size. jina-embeddings-v5-text models additionally support long texts (up to 32k tokens) in many languages, and generate embeddings that remain robust under truncation and binary quantization. Model weights are publicly available, hopefully inspiring further advances in embedding model development.
翻译:文本嵌入模型被广泛用于语义相似性任务,包括信息检索、聚类和分类。通用模型通常使用对比损失函数通过单阶段或多阶段过程进行训练。我们引入了一种新颖的训练方案,它将模型蒸馏技术与面向特定任务的对比损失相结合,以生成紧凑且高性能的嵌入模型。我们的研究结果表明,对于训练小型模型而言,这种方法比纯对比式或纯基于蒸馏的训练范式更为有效。所得模型 jina-embeddings-v5-text-small 和 jina-embeddings-v5-text-nano 的基准测试分数超过或达到了同类尺寸模型的最先进水平。此外,jina-embeddings-v5-text 模型支持多种语言的长文本(最多达 32k 词元),并能生成在截断和二进制量化下保持鲁棒性的嵌入向量。模型权重已公开,有望推动嵌入模型开发的进一步进展。