Accelerating Heterogeneous Tensor Parallelism via Flexible Workload Control

Transformer-based models are becoming deeper and larger recently. For better scalability, an underlying training solution in industry is to split billions of parameters (tensors) into many tasks and then run them across homogeneous accelerators (e.g., GPUs). However, such dedicated compute cluster is prohibitively expensive in academia and moderate companies. An economic replacement is to aggregate existing heterogeneous devices and share resources among multi-tenants. Nevertheless, static hardware configurations and dynamic resource contention definitely cause straggling tasks, which heavily slows down the overall training efficiency. Existing works feature contributions mainly tailored for traditional data parallelism. They cannot work well for the new tensor parallelism due to strict communication and correctness constraints. In this paper we first present ZERO-resizing, a novel dynamic workload balancing technique without any data migration. We tune workloads in real-time by temporarily resizing matrices involved in core tensor-related computations. We particularly design data imputation and priority selection policies to respectively satisfy consistency constraint required by normal training and reduce the accuracy loss. We also give a lightweight data migration technique without loss of accuracy, to cope with heavy heterogeneity. Our final SEMI-migration solution is built on top of these two techniques and can adaptively distinguish their respective balancing missions, to achieve an overall success in efficiency and accuracy. Extensive experiments on the representative Colossal-AI platform validate the effectiveness of our proposals.

翻译：基于Transformer的模型近期正变得更深更大。为提升可扩展性，工业界的底层训练方案通常将数十亿参数（张量）拆分为多个任务，并在同构加速器（如GPU）上执行。然而，专用计算集群的高昂成本在学术界和中等规模企业中难以承受。更具经济性的替代方案是聚合现有异构设备，并在多租户间共享资源。但静态硬件配置与动态资源竞争必然导致拖尾任务，严重拖慢整体训练效率。现有工作主要针对传统数据并行训练进行贡献，由于严格的通信与正确性约束，无法适用于新型张量并行。本文首次提出ZERO-resizing——一种无需数据迁移的动态负载均衡技术。通过实时调整核心张量计算涉及的矩阵尺寸来调控负载，并特别设计了数据填充与优先级选择策略，分别满足正常训练所需的约束一致性并减少精度损失。此外，我们提出一种不损失精度的轻量级数据迁移技术以应对重度异构场景。最终构建的SEMI-migration方案融合上述两种技术，能自适应区分各自的均衡任务，在效率与精度上实现整体成功。在代表性Colossal-AI平台上的大量实验验证了所提方法的有效性。