For machine learning models to be reliable and trustworthy, their decisions must be interpretable. As these models find increasing use in safety-critical applications, it is important that not just the model predictions but also their explanations (as feature attributions) be robust to small human-imperceptible input perturbations. Recent works have shown that many attribution methods are fragile and have proposed improvements in either these methods or the model training. We observe two main causes for fragile attributions: first, the existing metrics of robustness (e.g., top-k intersection) over-penalize even reasonable local shifts in attribution, thereby making random perturbations to appear as a strong attack, and second, the attribution can be concentrated in a small region even when there are multiple important parts in an image. To rectify this, we propose simple ways to strengthen existing metrics and attribution methods that incorporate locality of pixels in robustness metrics and diversity of pixel locations in attributions. Towards the role of model training in attributional robustness, we empirically observe that adversarially trained models have more robust attributions on smaller datasets, however, this advantage disappears in larger datasets. Code is available at https://github.com/ksandeshk/LENS.
翻译:为使机器学习模型可靠且可信,其决策必须可解释。随着这些模型在安全关键型应用中的日益普及,不仅模型预测本身需要鲁棒,其解释(特征归因)也应能抵抗微小的人眼不可察觉输入扰动。近期研究表明,许多归因方法存在脆弱性,并提出对方法或模型训练的改进。我们发现脆弱归因的两大根源:首先,现有鲁棒性指标(如top-k交集)过度惩罚了合理的局部归因偏移,导致随机扰动被视为强攻击;其次,当图像中存在多个重要区域时,归因可能集中在小范围。为解决此问题,我们提出简易方案:在鲁棒性指标中融入像素局部性,在归因中融入像素位置多样性,从而强化现有指标与归因方法。关于模型训练对归因鲁棒性的影响,实证表明对抗训练模型在小型数据集上具有更鲁棒的归因,然而这一优势在大型数据集中消失。代码见https://github.com/ksandeshk/LENS。