Towards Generalizable and Efficient Large-Scale Generative Recommenders

from arxiv, first published under netflix tech blog https://netflixtechblog.medium.com/towards-generalizable-and-efficient-large-scale-generative-recommenders-a7db648aa257

Generative recommendation models can model user behavior as sequences of events and provide a shared backbone for multiple recommendation tasks. In production, however, pre-training gains do not automatically translate into downstream application improvements: task headroom, repeated-training cost, serving latency, and item freshness all affect transfer. We describe our experience scaling a generative recommender from 2M to 1B backbone parameters, excluding embedding and decoding layers, in a production-scale title recommendation setting. Across multiple downstream tasks, we observe task-dependent scaling behavior: some tasks approach an empirical ceiling within the observed scale range, while others continue to benefit from additional capacity. This motivates using offset scaling-law fits as a diagnostic for where additional model scale may be more or less useful. We then study production constraints that arise when applying the model in practice. Frequent retraining over trillions of behavior tokens makes training and decoding efficiency important; cached serving can make the immediate next-token target stale; and newly launched titles may need to be scored from semantic metadata before collaborative ID embeddings are reliable. We address these issues with multi-token prediction for serving-latency alignment, sampled softmax and a projected decoding head for efficient repeated training, and semantic item towers with collaborative-embedding masking for cold-start adaptation. In a one-week production-shadow evaluation over 1M users, the 1B-backbone model achieves higher MRR than the 2M-backbone baseline across all reported tasks. Overall, the results support treating model scale as one component of a production transfer problem, alongside task headroom, decoding cost, serving-latency alignment, and item generalization.

翻译：生成式推荐模型能够将用户行为建模为事件序列，并为多种推荐任务提供共享主干。然而，在生产环境中，预训练获得的增益并不会自动转化为下游应用的改进：任务提升空间、重复训练成本、服务延迟以及物品新鲜度都会影响迁移效果。我们描述了将生成式推荐系统的骨干参数（不包括嵌入层和解码层）从2M扩展到1B的生产规模经验，应用于标题推荐场景。在多个下游任务中，我们观察到依赖任务规模的扩展行为：某些任务在观测规模范围内达到了经验上限，而另一些任务则持续受益于更大的容量。这促使我们使用偏移缩放定律拟合作为诊断工具，判断模型规模扩展在哪些场景更有效或更无效。接着，我们研究了模型实际应用时产生的生产约束：基于数万亿行为令牌的频繁重训练使训练和解码效率至关重要；缓存服务可能导致即时下一令牌目标过时；新上线的标题可能需要在协同ID嵌入可靠前，通过语义元数据进行评分。我们通过多令牌预测对齐服务延迟、采用采样softmax和投影解码头实现高效重复训练，以及通过协同嵌入掩码的语义物品塔适配冷启动问题。在针对100万用户的一周生产影子评估中，1B骨干模型在所有报告任务上均实现了比2M骨干基线更高的MRR。总体而言，结果支持将模型规模视为生产迁移问题的一个组成部分，需与任务提升空间、解码成本、服务延迟对齐和物品泛化能力综合考量。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

综述：代理式推荐系统路线图

专知会员服务

12+阅读 · 7月8日

【WWW2026】用于多模态推荐的基础模型个性化参数高效微调研究

专知会员服务

5+阅读 · 2月20日

基础模型驱动的推荐系统综述：从特征驱动、生成式到智能体范式

专知会员服务

23+阅读 · 2025年4月24日

大规模语言模型增强推荐系统：分类、趋势、应用与未来

专知会员服务

41+阅读 · 2024年12月22日