We introduce MOON, our comprehensive set of sustainable iterative practices for multimodal representation learning for e-commerce applications. MOON has already been fully deployed across all stages of Taobao search advertising system, including retrieval, relevance, ranking, and so on. The performance gains are particularly significant on click-through rate (CTR) prediction task, which achieves an overall +20.00% online CTR improvement. Over the past three years, this project has delivered the largest improvement on CTR prediction task and undergone five full-scale iterations. Throughout the exploration and iteration of our MOON, we have accumulated valuable insights and practical experience that we believe will benefit the research community. MOON contains a three-stage training paradigm of "Pretraining, Post-training, and Application", allowing effective integration of multimodal representations with downstream tasks. Notably, to bridge the misalignment between the objectives of multimodal representation learning and downstream training, we define the exchange rate to quantify how effectively improvements in an intermediate metric can translate into downstream gains. Through this analysis, we identify the image-based search recall as a critical intermediate metric guiding the optimization of multimodal models. Over three years and five iterations, MOON has evolved along four critical dimensions: data processing, training strategy, model architecture, and downstream application. The lessons and insights gained through the iterative improvements will also be shared. As part of our exploration into scaling effects in the e-commerce field, we further conduct a systematic study of the scaling laws governing multimodal representation learning, examining multiple factors such as the number of training tokens, negative samples, and the length of user behavior sequences.
翻译:本文介绍了MOON,一套用于电商应用的多模态表征学习的可持续迭代实践体系。MOON已全面部署于淘宝搜索广告系统的各个环节,包括召回、相关性、排序等。其在点击率(CTR)预测任务上的性能提升尤为显著,实现了整体线上CTR +20.00%的增长。过去三年间,该项目在CTR预测任务上取得了最大幅度的改进,并经历了五次完整迭代。在MOON的探索与迭代过程中,我们积累了宝贵的见解与实践经验,相信将有益于研究社区。MOON包含“预训练、后训练与应用”三阶段训练范式,能够有效整合多模态表征与下游任务。值得注意的是,为弥合多模态表征学习目标与下游训练目标之间的不一致性,我们定义了“兑换率”以量化中间指标改进转化为下游增益的有效程度。通过此分析,我们确定了基于图像的搜索召回率作为指导多模态模型优化的关键中间指标。历经三年五次迭代,MOON在数据处理、训练策略、模型架构及下游应用四个关键维度持续演进。通过迭代改进获得的经验与洞察亦将在此分享。作为对电商领域缩放效应的探索,我们进一步系统研究了多模态表征学习的缩放规律,考察了训练词元数量、负样本数量及用户行为序列长度等多重因素。