We introduce a novel optimization problem formulation that departs from the conventional way of minimizing machine learning model loss as a black-box function. Unlike traditional formulations, the proposed approach explicitly incorporates an initially pre-trained model and random sketch operators, allowing for sparsification of both the model and gradient during training. We establish the insightful properties of the proposed objective function and highlight its connections to the standard formulation. Furthermore, we present several variants of the Stochastic Gradient Descent (SGD) method adapted to the new problem formulation, including SGD with general sampling, a distributed version, and SGD with variance reduction techniques. We achieve tighter convergence rates and relax assumptions, bridging the gap between theoretical principles and practical applications, covering several important techniques such as Dropout and Sparse training. This work presents promising opportunities to enhance the theoretical understanding of model training through a sparsification-aware optimization approach.
翻译:我们提出了一种新颖的优化问题表述,它不同于将机器学习模型损失作为黑盒函数进行最小化的传统方式。与传统表述不同,所提出的方法显式地结合了一个初始预训练模型和随机草图算子,从而允许在训练期间对模型和梯度进行稀疏化。我们建立了所提出目标函数的深刻性质,并强调了其与标准表述的联系。此外,我们提出了几种适用于新问题表述的随机梯度下降(SGD)方法的变体,包括通用采样的SGD、分布式版本以及采用方差缩减技术的SGD。我们获得了更紧的收敛速率并放宽了假设,从而弥合了理论原理与实际应用之间的差距,涵盖了Dropout和稀疏训练等若干重要技术。这项工作通过一种稀疏化感知的优化方法,为增强模型训练的理论理解提供了有前景的机遇。