Deriving predictable scaling laws that govern the relationship between model performance and computational investment is crucial for designing and allocating resources in massive-scale recommendation systems. While such laws are established for large language models, they remain challenging for recommendation systems, especially those processing both user history and context features. We identify poor scaling efficiency as the main barrier to predictable power-law scaling, stemming from inefficient modules with low Model FLOPs Utilization (MFU) and suboptimal resource allocation. We introduce Kunlun, a scalable architecture that systematically improves model efficiency and resource allocation. Our low-level optimizations include Generalized Dot-Product Attention (GDPA), Hierarchical Seed Pooling (HSP), and Sliding Window Attention. Our high-level innovations feature Computation Skip (CompSkip) and Event-level Personalization. These advances increase MFU from 17% to 37% on NVIDIA B200 GPUs and double scaling efficiency over state-of-the-art methods. Kunlun is now deployed in major Meta Ads models, delivering significant production impact.
翻译:推导可预测的缩放定律以明确模型性能与计算投入之间的关系,对于大规模推荐系统的设计与资源分配至关重要。尽管此类定律已在大型语言模型中确立,但在推荐系统尤其是需要同时处理用户历史与上下文特征的系统中仍面临挑战。我们发现,缩放效率低下是实现可预测幂律缩放的主要障碍,其根源在于低模型浮点运算利用率(MFU)的低效模块以及次优的资源分配。本文提出昆仑——一种可扩展的架构,系统性地提升模型效率与资源分配能力。我们的底层优化包括广义点积注意力(GDPA)、分层种子池化(HSP)以及滑动窗口注意力。高层创新则涵盖计算跳过(CompSkip)与事件级个性化。这些改进将NVIDIA B200 GPU上的MFU从17%提升至37%,并将缩放效率较现有最优方法提高一倍。昆仑目前已部署于Meta主要广告模型中,产生了显著的生产效益。