Curvature in the Looking-Glass: Optimal Methods to Exploit Curvature of Expectation in the Loss Landscape

Harnessing the local topography of the loss landscape is a central challenge in advanced optimization tasks. By accounting for the effect of potential parameter changes, we can alter the model more efficiently. Contrary to standard assumptions, we find that the Hessian does not always approximate loss curvature well, particularly near gradient discontinuities, which commonly arise in deep learning architectures. We present a new conceptual framework to understand how curvature of expected changes in loss emerges in architectures with many rectified linear units. Each ReLU creates a parameter boundary that, when crossed, induces a pseudorandom gradient perturbation. Our derivations show how these discontinuities combine to form a glass-like structure, similar to amorphous solids that contain microscopic domains of strong, but random, atomic alignment. By estimating the density of the resulting gradient variations, we can bound how the loss may change with parameter movement. Our analysis includes the optimal kernel and sample distribution for approximating glass density from ordinary gradient evaluations. We also derive the optimal modification to quasi-Newton steps that incorporate both glass and Hessian terms, as well as certain exactness properties that are possible with Nesterov-accelerated gradient updates. Our algorithm, Alice, tests these techniques to determine which curvature terms are most impactful for training a given architecture and dataset. Additional safeguards enforce stable exploitation through step bounds that expand on the functionality of Adam. These theoretical and experimental tools lay groundwork to improve future efforts (e.g., pruning and quantization) by providing new insight into the loss landscape.

翻译：利用损失景观的局部地形是高级优化任务中的核心挑战。通过考虑潜在参数变化的影响，我们可以更高效地调整模型。与标准假设相反，我们发现海森矩阵并不总能很好地近似损失曲率，特别是在梯度不连续点附近——这在深度学习架构中普遍存在。我们提出了一个新的概念框架，以理解损失期望变化的曲率如何在具有大量整流线性单元的架构中产生。每个ReLU都会创建一个参数边界，当参数跨越该边界时，会引发伪随机梯度扰动。我们的推导表明这些不连续性如何结合形成类似玻璃的结构，类似于含有强随机原子排列微观域的非晶态固体。通过估计所得梯度变化的密度，我们可以界定损失随参数移动可能发生的变化范围。我们的分析包含了用于从常规梯度评估中近似玻璃密度的最优核函数与样本分布。我们还推导了结合玻璃项与海森矩阵项的拟牛顿步长最优修正方法，以及通过Nesterov加速梯度更新可能实现的某些精确性质。我们的算法Alice测试了这些技术，以确定哪些曲率项对特定架构和数据集的训练最具影响力。额外的安全机制通过扩展Adam功能的步长边界来确保稳定利用。这些理论与实验工具为未来研究（如剪枝与量化）提供了对损失景观的新见解，奠定了改进基础。