Overparameterized models have proven to be powerful tools for solving various machine learning tasks. However, overparameterization often leads to a substantial increase in computational and memory costs, which in turn requires extensive resources to train. In this work, we present a novel approach for compressing overparameterized models, developed through studying their learning dynamics. We observe that for many deep models, updates to the weight matrices occur within a low-dimensional invariant subspace. For deep linear models, we demonstrate that their principal components are fitted incrementally within a small subspace, and use these insights to propose a compression algorithm for deep linear networks that involve decreasing the width of their intermediate layers. We empirically evaluate the effectiveness of our compression technique on matrix recovery problems. Remarkably, by using an initialization that exploits the structure of the problem, we observe that our compressed network converges faster than the original network, consistently yielding smaller recovery errors. We substantiate this observation by developing a theory focused on deep matrix factorization. Finally, we empirically demonstrate how our compressed model has the potential to improve the utility of deep nonlinear models. Overall, our algorithm improves the training efficiency by more than 2x, without compromising generalization.
翻译:过参数化模型已被证明是解决各种机器学习任务的强大工具。然而,过参数化通常会导致计算和内存成本大幅增加,进而需要大量资源进行训练。在本工作中,我们提出了一种通过研究过参数化模型学习动力学进行压缩的新方法。我们观察到,对于许多深度模型而言,权重矩阵的更新发生在低维不变子空间内。对于深度线性模型,我们证明其主成分是在小子空间内逐步拟合的,并利用这些见解提出了一种针对深度线性网络的压缩算法,该算法涉及减小其中间层的宽度。我们在矩阵恢复问题上经验性地评估了压缩技术的有效性。值得注意的是,通过使用利用问题结构的初始化,我们观察到压缩网络的收敛速度比原始网络更快,且始终获得更小的恢复误差。我们通过发展以深度矩阵分解为中心的理论证实了这一观察结果。最后,我们经验性地展示了压缩模型如何具有提升深度非线性模型效用的潜力。总体而言,我们的算法在不影响泛化能力的情况下,将训练效率提升了2倍以上。