Trustworthy machine learning necessitates meticulous regulation of model reliance on non-robust features. We propose a framework to delineate and regulate such features by attributing model predictions to the input. Within our approach, robust feature attributions exhibit a certain consistency, while non-robust feature attributions are susceptible to fluctuations. This behavior allows identification of correlation between model reliance on non-robust features and smoothness of marginal density of the input samples. Hence, we uniquely regularize the gradients of the marginal density w.r.t. the input features for robustness. We also devise an efficient implementation of our regularization to address the potential numerical instability of the underlying optimization process. Moreover, we analytically reveal that, as opposed to our marginal density smoothing, the prevalent input gradient regularization smoothens conditional or joint density of the input, which can cause limited robustness. Our experiments validate the effectiveness of the proposed method, providing clear evidence of its capability to address the feature leakage problem and mitigate spurious correlations. Extensive results further establish that our technique enables the model to exhibit robustness against perturbations in pixel values, input gradients, and density.
翻译:可信赖的机器学习需要对模型依赖非鲁棒特征的行为进行精细调控。本文提出一个通过归因模型预测至输入特征来刻画并调控此类特征的框架。在我们的方法中,鲁棒特征归因表现出特定的一致性,而非鲁棒特征归因则易受波动影响。该特性揭示了模型对非鲁棒特征的依赖与输入样本边缘密度平滑度之间的关联。因此,我们创新性地通过对输入特征的边缘密度梯度进行正则化来提升鲁棒性。同时,我们设计了一种高效的正则化实现方案,以应对底层优化过程可能出现的数值不稳定问题。此外,通过理论分析我们发现:与本文提出的边缘密度平滑方法不同,当前主流的输入梯度正则化技术实际上平滑的是输入的条件密度或联合密度,这可能导致鲁棒性提升有限。实验验证了所提方法的有效性,清晰证明了其解决特征泄漏问题和缓解伪相关性的能力。大量实验结果进一步表明,我们的技术能使模型对像素值扰动、输入梯度扰动及密度扰动均表现出鲁棒性。