We study a mismatch between the deep learning recommendation models' flat architecture, common distributed training paradigm and hierarchical data center topology. To address the associated inefficiencies, we propose Disaggregated Multi-Tower (DMT), a modeling technique that consists of (1) Semantic-preserving Tower Transform (SPTT), a novel training paradigm that decomposes the monolithic global embedding lookup process into disjoint towers to exploit data center locality; (2) Tower Module (TM), a synergistic dense component attached to each tower to reduce model complexity and communication volume through hierarchical feature interaction; and (3) Tower Partitioner (TP), a feature partitioner to systematically create towers with meaningful feature interactions and load balanced assignments to preserve model quality and training throughput via learned embeddings. We show that DMT can achieve up to 1.9x speedup compared to the state-of-the-art baselines without losing accuracy across multiple generations of hardware at large data center scales.
翻译:我们研究了深度学习推荐模型的扁平架构、常见分布式训练范式与数据中心分层拓扑之间的不匹配问题。为解决由此产生的效率低下,提出分解式多塔(DMT)这一建模技术,其包含:(1)语义保持塔变换(SPTT),一种将整体全局嵌入查找过程分解为不连通塔以利用数据中心局部性的新型训练范式;(2)塔模块(TM),为每个塔附加的协同密集组件,通过分层特征交互降低模型复杂度与通信量;(3)塔划分器(TP),一种特征划分器,通过学习嵌入系统化地创建具有有意义特征交互与负载均衡分配的塔,以保持模型质量与训练吞吐量。实验表明,在多种硬件代际的大规模数据中心场景下,DMT相比最先进基线可实现高达1.9倍加速且不损失精度。