Multimodal embedding models aim to yield informative unified representations that empower diverse cross-modal tasks. Despite promising developments in the evolution from CLIP-based dual-tower architectures to large vision-language models, prior works still face unavoidable challenges in real-world applications and business scenarios, such as the limited modality support, unstable training mechanisms, and industrial domain gaps. In this work, we introduce SAIL-Embedding, an omni-modal embedding foundation model that addresses these issues through tailored training strategies and architectural design. In the optimization procedure, we propose a multi-stage training scheme to boost the multifaceted effectiveness of representation learning. Specifically, the content-aware progressive training aims to enhance the model's adaptability to diverse downstream tasks and master enriched cross-modal proficiency. The collaboration-aware recommendation enhancement training further adapts multimodal representations for recommendation scenarios by distilling knowledge from sequence-to-item and ID-to-item embeddings while mining user historical interests. Concurrently, we develop the stochastic specialization and dataset-driven pattern matching to strengthen model training flexibility and generalizability. Experimental results show that SAIL-Embedding achieves SOTA performance compared to other methods in different retrieval tasks. In online experiments across various real-world scenarios integrated with our model, we observe a significant increase in Lifetime (LT), which is a crucial indicator for the recommendation experience. For instance, the model delivers the 7-day LT gain of +0.5% in the Douyin-Selected scenario. For the Douyin feed rank model, the match features produced by SAIL-Embedding yield a +0.1% AUC gain.
翻译:多模态嵌入模型旨在生成信息丰富的统一表示,以赋能多样的跨模态任务。尽管从基于CLIP的双塔架构到大型视觉语言模型的发展前景广阔,但先前的工作在现实世界应用和商业场景中仍面临不可避免的挑战,例如有限的模态支持、不稳定的训练机制以及工业领域差距。在本工作中,我们介绍了SAIL-Embedding,一个全模态嵌入基础模型,它通过量身定制的训练策略和架构设计来解决这些问题。在优化过程中,我们提出了一种多阶段训练方案,以提升表示学习在多方面的有效性。具体而言,内容感知的渐进式训练旨在增强模型对多样化下游任务的适应性,并掌握更丰富的跨模态能力。协作感知的推荐增强训练则通过从序列到物品和ID到物品的嵌入中蒸馏知识,同时挖掘用户历史兴趣,进一步使多模态表示适应推荐场景。同时,我们开发了随机专业化和数据集驱动的模式匹配,以增强模型训练的灵活性和泛化能力。实验结果表明,在不同检索任务中,SAIL-Embedding相比其他方法实现了SOTA性能。在将我们的模型集成到各种现实世界场景的在线实验中,我们观察到用户生命周期(LT)这一衡量推荐体验的关键指标显著提升。例如,在抖音精选场景中,该模型实现了7日LT增益+0.5%。对于抖音信息流排序模型,SAIL-Embedding生成的匹配特征带来了+0.1%的AUC增益。