Recent advances in Generative Artificial Intelligence have fueled numerous applications, particularly those involving Generative Adversarial Networks (GANs), which are essential for synthesizing realistic photos and videos. However, efficiently training GANs remains a critical challenge due to their computationally intensive and numerically unstable nature. Existing methods often require days or even weeks for training, posing significant resource and time constraints. In this work, we introduce ParaGAN, a scalable distributed GAN training framework that leverages asynchronous training and an asymmetric optimization policy to accelerate GAN training. ParaGAN employs a congestion-aware data pipeline and hardware-aware layout transformation to enhance accelerator utilization, resulting in over 30% improvements in throughput. With ParaGAN, we reduce the training time of BigGAN from 15 days to 14 hours while achieving 91% scaling efficiency. Additionally, ParaGAN enables unprecedented high-resolution image generation using BigGAN.
翻译:生成式人工智能的最新进展推动了众多应用的发展,特别是涉及生成对抗网络(GANs)的应用,这些网络对于合成逼真的照片和视频至关重要。然而,由于GANs计算密集且数值不稳定的特性,高效训练GANs仍然是一个关键挑战。现有方法通常需要数天甚至数周的训练时间,带来了巨大的资源和时间限制。本文提出了ParaGAN,一种可扩展的分布式GAN训练框架,它利用异步训练和不对称优化策略来加速GAN训练。ParaGAN采用拥塞感知数据流水线和硬件感知布局转换来提高加速器利用率,从而实现了超过30%的吞吐量提升。使用ParaGAN,我们将BigGAN的训练时间从15天缩短至14小时,同时实现了91%的扩展效率。此外,ParaGAN使得使用BigGAN进行前所未有的高分辨率图像生成成为可能。