Large recommendation models (LRMs) are fundamental to the multi-billion dollar online advertising industry, processing massive datasets of hundreds of billions of examples before transitioning to continuous online training to adapt to rapidly changing user behavior. The massive scale of data directly impacts both computational costs and the speed at which new methods can be evaluated (R&D velocity). This paper presents actionable principles and high-level frameworks to guide practitioners in optimizing training data requirements. These strategies have been successfully deployed in Google's largest Ads CTR prediction models and are broadly applicable beyond LRMs. We outline the concept of data convergence, describe methods to accelerate this convergence, and finally, detail how to optimally balance training data volume with model size.
翻译:大规模推荐模型(LRMs)是价值数十亿美元的在线广告行业的基础,它们在处理数千亿样本的海量数据集后,会过渡到持续的在线训练,以适应快速变化的用户行为。数据的庞大规模直接影响计算成本以及新方法的评估速度(研发效率)。本文提出了可操作的原则和高层框架,以指导从业者优化训练数据需求。这些策略已在谷歌最大的广告点击率预测模型中成功部署,并且广泛适用于大规模推荐模型之外的其他场景。我们概述了数据收敛的概念,描述了加速这种收敛的方法,并最终详细阐述了如何最优地平衡训练数据量与模型规模。