A number of production deep learning clusters have attempted to explore inference hardware for DNN training, at the off-peak serving hours with many inference GPUs idling. Conducting DNN training with a combination of heterogeneous training and inference GPUs, known as hybrid device training, presents considerable challenges due to disparities in compute capability and significant differences in memory capacity. We propose QSync, a training system that enables efficient synchronous data-parallel DNN training over hybrid devices by strategically exploiting quantized operators. According to each device's available resource capacity, QSync selects a quantization-minimized setting for operators in the distributed DNN training graph, minimizing model accuracy degradation but keeping the training efficiency brought by quantization. We carefully design a predictor with a bi-directional mixed-precision indicator to reflect the sensitivity of DNN layers on fixed-point and floating-point low-precision operators, a replayer with a neighborhood-aware cost mapper to accurately estimate the latency of distributed hybrid mixed-precision training, and then an allocator that efficiently synchronizes workers with minimized model accuracy degradation. QSync bridges the computational graph on PyTorch to an optimized backend for quantization kernel performance and flexible support for various GPU architectures. Extensive experiments show that QSync's predictor can accurately simulate distributed mixed-precision training with <5% error, with a consistent 0.27-1.03% accuracy improvement over the from-scratch training tasks compared to uniform precision.
翻译:许多生产深度学习集群尝试在推理硬件闲置的非高峰服务时段,利用推理GPU进行DNN训练。在计算能力存在差异且内存容量显著不同的异构训练与推理GPU组合(即混合设备训练)上进行DNN训练,带来了巨大挑战。我们提出QSync——一种通过策略性利用量化算子,在异构设备上实现高效同步数据并行DNN训练的系统。QSync根据各设备的可用资源容量,为分布式DNN训练图中的算子选择量化最小化配置,在最小化模型精度损失的同时保持量化带来的训练效率。我们精心设计了以下组件:配备双向混合精度指示器的预测器,用于反映DNN层对定点与浮点低精度算子的敏感性;集成邻域感知成本映射器的重放器,用于精确估计分布式异构混合精度训练的延迟;以及能高效同步工作节点并最小化模型精度损失的分配器。QSync将PyTorch上的计算图桥接到经过优化的量化内核后端,并灵活支持多种GPU架构。大量实验表明,QSync预测器能以低于5%的误差准确模拟分布式混合精度训练,相比均匀精度训练,在从头开始的训练任务中持续实现0.27-1.03%的精度提升。