Data-parallel distributed training of deep neural networks (DNN) has gained very widespread adoption, but can still experience communication bottlenecks. To address this issue, entire families of compression mechanisms have been developed, including quantization, sparsification, and low-rank approximation, some of which are seeing significant practical adoption. Despite this progress, almost all known compression schemes apply compression uniformly across DNN layers, although layers are heterogeneous in terms of parameter count and their impact on model accuracy. In this work, we provide a general framework for adapting the degree of compression across the model's layers dynamically during training, improving the overall compression, while leading to substantial speedups, without sacrificing accuracy. Our framework, called L-GreCo, is based on an adaptive algorithm, which automatically picks the optimal compression parameters for model layers guaranteeing the best compression ratio while satisfying an error constraint. Extensive experiments over image classification and language modeling tasks shows that L-GreCo is effective across all existing families of compression methods, and achieves up to 2.5$\times$ training speedup and up to 5$\times$ compression improvement over efficient implementations of existing approaches, while recovering full accuracy. Moreover, L-GreCo is complementary to existing adaptive algorithms, improving their compression ratio by 50% and practical throughput by 66%.
翻译:深度神经网络(DNN)的数据并行分布式训练已得到广泛普及,但仍可能面临通信瓶颈问题。为解决该问题,研究者已开发了包括量化、稀疏化和低秩近似在内的整类压缩机制,其中部分方法已获得显著的实际应用。尽管取得这些进展,几乎所有已知的压缩方案均对各DNN层采用均匀压缩,而实际上各层在参数数量及对模型精度的影响上具有异质性。本工作提出一个通用框架,可在训练过程中动态调整模型各层的压缩程度,在提升整体压缩率的同时实现显著加速,且不牺牲精度。该框架名为L-GreCo,基于自适应算法运行,可自动为模型各层选择最优压缩参数,在满足误差约束的前提下保证最佳压缩比。针对图像分类和语言建模任务的大量实验表明,L-GreCo在现有所有压缩方法族中均有效,相较于现有方法的高效实现可实现最高2.5倍训练加速和最高5倍压缩率提升,同时恢复完整精度。此外,L-GreCo与现有自适应算法互补,可将其压缩比提升50%,实际吞吐量提升66%。