As models grow larger and more complex, achieving better off-sample generalization with minimal trial-and-error is critical to the reliability and economy of machine learning workflows. As a proxy for the well-studied heuristic of seeking "flat" local minima, gradient regularization is a natural avenue, and first-order approximations such as Flooding and sharpness-aware minimization (SAM) have received significant attention, but their performance depends critically on hyperparameters (flood threshold and neighborhood radius, respectively) that are non-trivial to specify in advance. In order to develop a procedure which is more resilient to misspecified hyperparameters, with the hard-threshold "ascent-descent" switching device used in Flooding as motivation, we propose a softened, pointwise mechanism called SoftAD that downweights points on the borderline, limits the effects of outliers, and retains the ascent-descent effect. We contrast formal stationarity guarantees with those for Flooding, and empirically demonstrate how SoftAD can realize classification accuracy competitive with SAM and Flooding while maintaining a much smaller loss generalization gap and model norm. Our empirical tests range from simple binary classification on the plane to image classification using neural networks with millions of parameters; the key trends are observed across all datasets and models studied, and suggest a potential new approach to implicit regularization.
翻译:随着模型规模与复杂度的持续增长,以最少试错实现更优的样本外泛化能力,对机器学习工作流程的可靠性与经济性至关重要。作为探索“平坦”局部极小值这一被广泛研究的启发式策略的替代方案,梯度正则化是一条自然路径,其中一阶近似方法如Flooding和锐度感知最小化(SAM)备受关注,但其性能高度依赖于难以预先设定的超参数(分别为淹没阈值和邻域半径)。为开发对误设超参数更具鲁棒性的方法,受Flooding采用的硬阈值“上升-下降”切换机制启发,我们提出一种软化的逐点机制——SoftAD,该方法可降低边界点的权重、限制异常值的影响并保持上升-下降效应。我们从形式化驻点保证性与Flooding进行对比,并通过实验证明SoftAD在保持更小损失泛化差距和模型范数的同时,可实现与SAM及Flooding相当的分类准确率。实验范围涵盖从平面简单二分类到基于百万参数神经网络的图像分类,所有数据集和模型中观测到的关键趋势表明,这可能为隐式正则化提供新思路。