Generative recommendation models can model user behavior as sequences of events and provide a shared backbone for multiple recommendation tasks. In production, however, pre-training gains do not automatically translate into downstream application improvements: task headroom, repeated-training cost, serving latency, and item freshness all affect transfer. We describe our experience scaling a generative recommender from 2M to 1B backbone parameters, excluding embedding and decoding layers, in a production-scale title recommendation setting. Across multiple downstream tasks, we observe task-dependent scaling behavior: some tasks approach an empirical ceiling within the observed scale range, while others continue to benefit from additional capacity. This motivates using offset scaling-law fits as a diagnostic for where additional model scale may be more or less useful. We then study production constraints that arise when applying the model in practice. Frequent retraining over trillions of behavior tokens makes training and decoding efficiency important; cached serving can make the immediate next-token target stale; and newly launched titles may need to be scored from semantic metadata before collaborative ID embeddings are reliable. We address these issues with multi-token prediction for serving-latency alignment, sampled softmax and a projected decoding head for efficient repeated training, and semantic item towers with collaborative-embedding masking for cold-start adaptation. In a one-week production-shadow evaluation over 1M users, the 1B-backbone model achieves higher MRR than the 2M-backbone baseline across all reported tasks. Overall, the results support treating model scale as one component of a production transfer problem, alongside task headroom, decoding cost, serving-latency alignment, and item generalization.
翻译:生成式推荐模型能够将用户行为建模为事件序列,并为多种推荐任务提供共享主干。然而,在生产环境中,预训练获得的增益并不会自动转化为下游应用的改进:任务提升空间、重复训练成本、服务延迟以及物品新鲜度都会影响迁移效果。我们描述了将生成式推荐系统的骨干参数(不包括嵌入层和解码层)从2M扩展到1B的生产规模经验,应用于标题推荐场景。在多个下游任务中,我们观察到依赖任务规模的扩展行为:某些任务在观测规模范围内达到了经验上限,而另一些任务则持续受益于更大的容量。这促使我们使用偏移缩放定律拟合作为诊断工具,判断模型规模扩展在哪些场景更有效或更无效。接着,我们研究了模型实际应用时产生的生产约束:基于数万亿行为令牌的频繁重训练使训练和解码效率至关重要;缓存服务可能导致即时下一令牌目标过时;新上线的标题可能需要在协同ID嵌入可靠前,通过语义元数据进行评分。我们通过多令牌预测对齐服务延迟、采用采样softmax和投影解码头实现高效重复训练,以及通过协同嵌入掩码的语义物品塔适配冷启动问题。在针对100万用户的一周生产影子评估中,1B骨干模型在所有报告任务上均实现了比2M骨干基线更高的MRR。总体而言,结果支持将模型规模视为生产迁移问题的一个组成部分,需与任务提升空间、解码成本、服务延迟对齐和物品泛化能力综合考量。