Training deep neural networks (DNNs) in low-dimensional subspaces is a promising direction for achieving efficient training and better generalization performance. Previous works extract the subspaces by using random projection or performing dimensionality reduction method on the training trajectory, but these methods can be inefficient or unstable in terms of dimensionality and numerical operations. In this paper, we connect subspace training to weight averaging and propose Trainable Weight Averaging (TWA), a general approach for subspace training that generalizes the previous efforts. TWA is efficient in terms of dimensionality and also easy to use, making it a promising new method for subspace training. We further design an efficient scheme for subspace training to cope with large-scale problems, which allows parallel training across multiple nodes and evenly distributing the memory and computation burden to each node. We apply TWA to efficient neural network training and improving fine-tuning performance tasks to demonstrate the great efficiency and effectiveness of our approach. We conduct extensive experiments that cover various benchmark computer vision and neural language processing tasks with various architectures. The code of implementation is available at https://github.com/nblt/TWA.
翻译:在低维子空间中训练深度神经网络(DNN)是实现高效训练和更好泛化性能的有前景方向。以往的研究通过随机投影或在训练轨迹上执行降维方法来提取子空间,但这些方法在维度效率和数值稳定性方面可能存在不足。本文建立了子空间训练与权重平均之间的联系,提出可训练权重平均(TWA)——一种推广先前工作的通用子空间训练方法。TWA在维度上高效且易于使用,是一种有前景的子空间训练新方法。我们进一步设计了一种高效的大规模问题子空间训练方案,支持跨多节点并行训练,并将内存与计算负载均匀分配到每个节点。我们将TWA应用于高效神经网络训练和微调性能提升任务中,以展示该方法的卓越效率与有效性。我们进行了涵盖多种架构的基准计算机视觉和神经语言处理任务的广泛实验。实现代码可在 https://github.com/nblt/TWA 获取。