SAIL-Embedding Technical Report: Omni-modal Embedding Foundation Model

Lin Lin,Jiefeng Long,Zhihe Wan,Yuchi Wang,Dingkang Yang,Shuang Yang,Yueyang Yao,Xu Chen,Zirui Guo,Shengqiang Li,Weiran Li,Hanyu Li,Yaling Mou,Yan Qiu,Haiyang Yu,Xiao Liang,Hongsheng Li,Chao Feng

from arxiv, Technical Report

Multimodal embedding models aim to yield informative unified representations that empower diverse cross-modal tasks. Despite promising developments in the evolution from CLIP-based dual-tower architectures to large vision-language models, prior works still face unavoidable challenges in real-world applications and business scenarios, such as the limited modality support, unstable training mechanisms, and industrial domain gaps. In this work, we introduce SAIL-Embedding, an omni-modal embedding foundation model that addresses these issues through tailored training strategies and architectural design. In the optimization procedure, we propose a multi-stage training scheme to boost the multifaceted effectiveness of representation learning. Specifically, the content-aware progressive training aims to enhance the model's adaptability to diverse downstream tasks and master enriched cross-modal proficiency. The collaboration-aware recommendation enhancement training further adapts multimodal representations for recommendation scenarios by distilling knowledge from sequence-to-item and ID-to-item embeddings while mining user historical interests. Concurrently, we develop the stochastic specialization and dataset-driven pattern matching to strengthen model training flexibility and generalizability. Experimental results show that SAIL-Embedding achieves SOTA performance compared to other methods in different retrieval tasks. In online experiments across various real-world scenarios integrated with our model, we observe a significant increase in Lifetime (LT), which is a crucial indicator for the recommendation experience. For instance, the model delivers the 7-day LT gain of +0.5% in the Douyin-Selected scenario. For the Douyin feed rank model, the match features produced by SAIL-Embedding yield a +0.1% AUC gain.

翻译：多模态嵌入模型旨在生成信息丰富的统一表示，以赋能多样的跨模态任务。尽管从基于CLIP的双塔架构到大型视觉语言模型的发展前景广阔，但先前的工作在现实世界应用和商业场景中仍面临不可避免的挑战，例如有限的模态支持、不稳定的训练机制以及工业领域差距。在本工作中，我们介绍了SAIL-Embedding，一个全模态嵌入基础模型，它通过量身定制的训练策略和架构设计来解决这些问题。在优化过程中，我们提出了一种多阶段训练方案，以提升表示学习在多方面的有效性。具体而言，内容感知的渐进式训练旨在增强模型对多样化下游任务的适应性，并掌握更丰富的跨模态能力。协作感知的推荐增强训练则通过从序列到物品和ID到物品的嵌入中蒸馏知识，同时挖掘用户历史兴趣，进一步使多模态表示适应推荐场景。同时，我们开发了随机专业化和数据集驱动的模式匹配，以增强模型训练的灵活性和泛化能力。实验结果表明，在不同检索任务中，SAIL-Embedding相比其他方法实现了SOTA性能。在将我们的模型集成到各种现实世界场景的在线实验中，我们观察到用户生命周期（LT）这一衡量推荐体验的关键指标显著提升。例如，在抖音精选场景中，该模型实现了7日LT增益+0.5%。对于抖音信息流排序模型，SAIL-Embedding生成的匹配特征带来了+0.1%的AUC增益。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日