In this position paper, we argue that many post-mortem generalization measures -- those computed on trained networks -- are \textbf{fragile}: small training modifications that barely affect the performance of the underlying deep neural network can substantially change a measure's value, trend, or scaling behavior. For example, minor hyperparameter changes, such as learning rate adjustments or switching between SGD variants, can reverse the slope of a learning curve in widely used generalization measures such as the path norm. We also identify subtler forms of fragility. For instance, the PAC-Bayes origin measure is regarded as one of the most reliable, and is indeed less sensitive to hyperparameter tweaks than many other measures. However, it completely fails to capture differences in data complexity across learning curves. This data fragility contrasts with the function-based marginal-likelihood PAC-Bayes bound, which does capture differences in data-complexity, including scaling behavior, in learning curves, but which is not a post-mortem measure. Beyond demonstrating that many post-mortem bounds are fragile, this position paper also argues that developers of new measures should explicitly audit them for fragility.
翻译:在这篇观点论文中,我们指出许多后验泛化度量指标——即在已训练网络上计算的指标——具有**脆弱性**:那些几乎不影响底层深度神经网络性能的微小训练调整,可能显著改变度量指标的值、趋势或缩放行为。例如,轻微的超参数调整(如学习率调整或在SGD变体间切换)可能逆转广泛使用的泛化度量指标(如路径范数)中学习曲线的斜率。我们还发现了更隐蔽的脆弱性形式。例如,PAC-Bayes原点度量被视为最可靠的指标之一,且确实对超参数调整的敏感性低于许多其他度量。然而,该指标完全无法捕捉学习曲线中数据复杂度的差异。这种数据脆弱性与基于函数的边缘似然PAC-Bayes界形成对比——后者确实能捕捉学习曲线中数据复杂度的差异(包括缩放行为),但其并非后验度量指标。除了论证许多后验界的脆弱性外,本文还主张新度量指标的开发者应明确对其脆弱性进行审计。