AutoML for Large Capacity Modeling of Meta's Ranking Systems

Hang Yin,Kuang-Hung Liu,Mengying Sun,Yuxin Chen,Buyun Zhang,Jiang Liu,Vivek Sehgal,Rudresh Rajnikant Panchal,Eugen Hotaj,Xi Liu,Daifeng Guo,Jamey Zhang,Zhou Wang,Shali Jiang,Huayu Li,Zhengxing Chen,Wen-Yen Chen,Jiyan Yang,Wei Wen

from arxiv, Hang Yin and Kuang-Hung Liu contribute equally

Web-scale ranking systems at Meta serving billions of users is complex. Improving ranking models is essential but engineering heavy. Automated Machine Learning (AutoML) can release engineers from labor intensive work of tuning ranking models; however, it is unknown if AutoML is efficient enough to meet tight production timeline in real-world and, at the same time, bring additional improvements to the strong baselines. Moreover, to achieve higher ranking performance, there is an ever-increasing demand to scale up ranking models to even larger capacity, which imposes more challenges on the efficiency. The large scale of models and tight production schedule requires AutoML to outperform human baselines by only using a small number of model evaluation trials (around 100). We presents a sampling-based AutoML method, focusing on neural architecture search and hyperparameter optimization, addressing these challenges in Meta-scale production when building large capacity models. Our approach efficiently handles large-scale data demands. It leverages a lightweight predictor-based searcher and reinforcement learning to explore vast search spaces, significantly reducing the number of model evaluations. Through experiments in large capacity modeling for CTR and CVR applications, we show that our method achieves outstanding Return on Investment (ROI) versus human tuned baselines, with up to 0.09% Normalized Entropy (NE) loss reduction or $25\%$ Query per Second (QPS) increase by only sampling one hundred models on average from a curated search space. The proposed AutoML method has already made real-world impact where a discovered Instagram CTR model with up to -0.36% NE gain (over existing production baseline) was selected for large-scale online A/B test and show statistically significant gain. These production results proved AutoML efficacy and accelerated its adoption in ranking systems at Meta.

翻译：服务于数十亿用户的Meta网络级排序系统十分复杂。改进排序模型至关重要，但工程负担沉重。自动机器学习（AutoML）能够将工程师从调整排序模型的繁重劳动中解放出来；然而，AutoML是否足够高效以满足实际生产中紧迫的上线时间线，并同时为强基线模型带来额外提升，尚不可知。此外，为追求更高排序性能，对扩展排序模型至更大容量的需求日益增长，这给效率带来了更多挑战。模型的大规模与紧迫的生产排期要求AutoML仅通过少量（约100次）模型评估试验便超越人工基线。本文提出一种基于采样的AutoML方法，聚焦于神经架构搜索与超参数优化，以应对构建大规模容量模型时Meta级生产环境中的这些挑战。该方法高效处理大规模数据需求，利用轻量级基于预测器的搜索器与强化学习探索庞大搜索空间，显著减少模型评估次数。通过在点击率（CTR）与转化率（CVR）应用的大容量建模实验，我们展示了该方法对比人工调优基线实现了卓越的投资回报率（ROI）：仅从精心设计的搜索空间中平均采样100个模型，即可实现高达0.09%的归一化熵（NE）损失降低或25%的每秒查询数（QPS）提升。所提出的AutoML方法已产生实际影响，其中一个经发现的Instagram CTR模型（相比现有生产基线提升-0.36% NE）被选中进行大规模在线A/B测试，并显示出统计显著的增益。这些生产结果证明了AutoML的有效性，并加速了其在Meta排序系统中的采用。