Multimodal content is crucial for click-through rate (CTR) prediction. However, directly incorporating continuous embeddings from pre-trained models into CTR models yields suboptimal results due to misaligned optimization objectives and convergence speed inconsistency during joint training. Discretizing embeddings into semantic IDs before feeding them into CTR models offers a more effective solution, yet existing methods suffer from limited codebook utilization, reconstruction accuracy, and semantic discriminability. We propose RQ-GMM (Residual Quantized Gaussian Mixture Model), which introduces probabilistic modeling to better capture the statistical structure of multimodal embedding spaces. Through Gaussian Mixture Models combined with residual quantization, RQ-GMM achieves superior codebook utilization and reconstruction accuracy. Experiments on public datasets and online A/B tests on a large-scale short-video platform serving hundreds of millions of users demonstrate substantial improvements: RQ-GMM yields a 1.502% gain in Advertiser Value over strong baselines. The method has been fully deployed, serving daily recommendations for hundreds of millions of users.
翻译:多模态内容对于点击率(CTR)预测至关重要。然而,由于联合训练期间优化目标不一致以及收敛速度不匹配,直接将来自预训练模型的连续嵌入纳入CTR模型会导致次优结果。在将嵌入输入CTR模型之前,将其离散化为语义ID提供了一种更有效的解决方案,但现有方法存在码本利用率低、重建精度差和语义区分能力不足的问题。我们提出了RQ-GMM(残差量化高斯混合模型),该方法引入概率建模以更好地捕捉多模态嵌入空间的统计结构。通过结合高斯混合模型与残差量化,RQ-GMM实现了卓越的码本利用率和重建精度。在公共数据集上的实验以及在一个服务于数亿用户的大型短视频平台上进行的在线A/B测试均证明了显著的改进:与强大的基线相比,RQ-GMM在广告主价值指标上带来了1.502%的提升。该方法已全面部署,为数亿用户的每日推荐提供服务。