Memory-efficient training of deep neural networks has become increasingly important as models grow larger while deployment environments impose strict resource constraints. We propose TraDy, a novel transfer learning scheme leveraging two key insights: layer importance for updates is architecture-dependent and determinable a priori, while dynamic stochastic channel selection provides superior gradient approximation compared to static approaches. We introduce a dynamic channel selection approach that stochastically resamples channels between epochs within preselected layers. Extensive experiments demonstrate TraDy achieves state-of-the-art performance across various downstream tasks and architectures while maintaining strict memory constraints, achieving up to 99% activation sparsity, 95% weight derivative sparsity, and 97% reduction in FLOPs for weight derivative computation.
翻译:随着模型规模不断扩大而部署环境施加严格的资源限制,深度神经网络的内存高效训练已变得日益重要。我们提出TraDy方案,这是一种新颖的迁移学习框架,其基于两个关键发现:网络层更新的重要性具有架构依赖性且可先验确定,而动态随机通道选择相比静态方法能提供更优的梯度近似。我们提出一种动态通道选择方法,该方法在预选网络层中跨训练周期对通道进行随机重采样。大量实验表明,TraDy在保持严格内存约束的同时,能在多种下游任务和架构上实现最先进的性能,达到高达99%的激活稀疏度、95%的权重梯度稀疏度,以及权重梯度计算中97%的FLOPs减少量。