Deriving predictable scaling laws that govern the relationship between model performance and computational investment is crucial for designing and allocating resources in massive-scale recommendation systems. While such laws are established for large language models, they remain challenging for recommendation systems, especially those processing both user history and context features. We identify poor scaling efficiency as the main barrier to predictable power-law scaling, stemming from inefficient modules with low Model FLOPs Utilization (MFU) and suboptimal resource allocation. We introduce Kunlun, a scalable architecture that systematically improves model efficiency and resource allocation. Our low-level optimizations include Generalized Dot-Product Attention (GDPA), Hierarchical Seed Pooling (HSP), and Sliding Window Attention. Our high-level innovations feature Computation Skip (CompSkip) and Event-level Personalization. These advances increase MFU from 17% to 37% on NVIDIA B200 GPUs and double scaling efficiency over state-of-the-art methods. Kunlun is now deployed in major Meta Ads models, delivering significant production impact.
翻译:推导可预测的缩放定律(scaling laws)以刻画模型性能与计算投入之间的关系,对于大规模推荐系统的设计与资源分配至关重要。尽管此类定律已在大型语言模型中确立,但在推荐系统中——特别是那些同时处理用户历史与上下文特征的推荐系统——仍面临挑战。我们将低缩放效率(scaling efficiency)识别为阻碍可预测幂律缩放(power-law scaling)的主要障碍,其根源在于低模型算力利用率(Model FLOPs Utilization, MFU)的低效模块与次优资源分配。我们提出Kunlun——一种可扩展架构,系统性地提升模型效率与资源分配。其底层优化包括广义点积注意力(Generalized Dot-Product Attention, GDPA)、层次化种子池化(Hierarchical Seed Pooling, HSP)与滑动窗口注意力(Sliding Window Attention)。高层创新则涵盖计算跳过(Computation Skip, CompSkip)与事件级个性化(Event-level Personalization)。这些进展在NVIDIA B200 GPU上将MFU从17%提升至37%,且缩放效率较当前最优方法翻倍。目前Kunlun已部署于Meta主要广告模型,并带来显著的生产影响。