Recently, a new paradigm, meta learning, has been widely applied to Deep Learning Recommendation Models (DLRM) and significantly improves statistical performance, especially in cold-start scenarios. However, the existing systems are not tailored for meta learning based DLRM models and have critical problems regarding efficiency in distributed training in the GPU cluster. It is because the conventional deep learning pipeline is not optimized for two task-specific datasets and two update loops in meta learning. This paper provides a high-performance framework for large-scale training for Optimization-based Meta DLRM models over the \textbf{G}PU cluster, namely \textbf{G}-Meta. Firstly, G-Meta utilizes both data parallelism and model parallelism with careful orchestration regarding computation and communication efficiency, to enable high-speed distributed training. Secondly, it proposes a Meta-IO pipeline for efficient data ingestion to alleviate the I/O bottleneck. Various experimental results show that G-Meta achieves notable training speed without loss of statistical performance. Since early 2022, G-Meta has been deployed in Alipay's core advertising and recommender system, shrinking the continuous delivery of models by four times. It also obtains 6.48\% improvement in Conversion Rate (CVR) and 1.06\% increase in CPM (Cost Per Mille) in Alipay's homepage display advertising, with the benefit of larger training samples and tasks.
翻译:近年来,元学习这一新兴范式被广泛应用于深度学习推荐模型(DLRM),显著提升了统计性能,尤其在冷启动场景中表现突出。然而,现有系统并非针对基于元学习的DLRM模型设计,在GPU集群分布式训练中面临严峻的效率问题。这是因为传统深度学习流水线未能针对元学习中的两个任务特定数据集及双更新循环进行优化。本文提出了一种面向GPU集群的大规模优化型元DLRM模型高性能训练框架——G-Meta。首先,G-Meta通过精心协调数据并行与模型并行的计算及通信效率实现高速分布式训练;其次,提出Meta-IO流水线用于高效数据摄入以缓解I/O瓶颈。多项实验结果表明,G-Meta在保持统计性能不损失的前提下显著提升训练速度。自2022年初以来,G-Meta已部署于支付宝核心广告与推荐系统,将模型持续交付周期缩短四倍。同时,借助更大规模的训练样本与任务,在支付宝首页展示广告中实现了转化率(CVR)6.48%的提升及每千次展示成本(CPM)1.06%的增长。