Generalization properties are a central aspect of the design and analysis of learning algorithms. One notion that has been considered in many previous works as leading to good generalization is flat minima, which informally describes a loss surface that is insensitive to noise perturbations. However, the design of efficient algorithms (that are easy to analyze) to find them is relatively under-explored. In this paper, we propose a new algorithm to address this issue, which minimizes a stochastic optimization objective that averages noise perturbations injected into the weights of a function. This algorithm is shown to enjoy both theoretical and empirical advantages compared to existing algorithms involving worst-case perturbations. Theoretically, we show tight convergence rates of our algorithm to find first-order stationary points of the stochastic objective. Empirically, the algorithm induces a penalty on the trace of the Hessian, leading to iterates that are flatter than SGD and other alternatives, with tighter generalization gaps. Altogether, this work contributes a provable and practical algorithm to find flat minima by optimizing the noise stability properties of a function.
翻译:泛化性能是学习算法设计与分析的核心要素。诸多先前研究认为,平坦极小值(flat minima)是导致良好泛化的关键因素——其非正式定义为对噪声扰动不敏感的损失曲面。然而,如何设计易于分析的、能高效寻找平坦极小值的算法仍相对缺乏探索。本文提出一种新算法:通过最小化对函数权重注入噪声扰动的随机优化目标来解决该问题。与现有采用最坏情况扰动的算法相比,该算法兼具理论与实证优势。理论上,我们证明了该算法在求解随机目标一阶驻点时具有紧致收敛速率;实证上,该算法通过对Hessian矩阵迹的惩罚机制,使迭代结果比SGD及其他替代方法更平坦,同时获得更紧的泛化差距。综上,本文通过优化函数的噪声稳定性特性,为寻找平坦极小值提供了一种兼具可证明性与实用性的算法。