Neural network training tends to exploit the simplest features as shortcuts to greedily minimize training loss. However, some of these features might be spuriously correlated with the target labels, leading to incorrect predictions by the model. Several methods have been proposed to address this issue. Focusing on suppressing the spurious correlations with model training, they not only incur additional training cost, but also have limited practical utility as the model misbehavior due to spurious relations is usually discovered after its deployment. It is also often overlooked that spuriousness is a subjective notion. Hence, the precise questions that must be investigated are; to what degree a feature is spurious, and how we can proportionally distract the model's attention from it for reliable prediction. To this end, we propose a method that enables post-hoc neutralization of spurious feature impact, controllable to an arbitrary degree. We conceptualize spurious features as fictitious sub-classes within the original classes, which can be eliminated by a class removal scheme. We then propose a unique precise class removal technique that employs a single-weight modification, which entails negligible performance compromise for the remaining classes. We perform extensive experiments, demonstrating that by editing just a single weight in a post-hoc manner, our method achieves highly competitive, or better performance against the state-of-the-art methods.
翻译:神经网络训练倾向于利用最简单的特征作为捷径来贪婪地最小化训练损失。然而,其中某些特征可能与目标标签存在虚假相关性,导致模型做出错误预测。已有多种方法被提出以解决此问题。这些方法侧重于通过模型训练抑制伪相关性,不仅会产生额外的训练成本,而且实际效用有限,因为由伪关系导致的模型异常行为通常在部署后才被发现。伪相关性本质上是一种主观概念这一事实也常被忽视。因此,必须研究的核心问题是:特征在何种程度上是伪相关的,以及我们如何能按比例分散模型对其的注意力以实现可靠预测。为此,我们提出一种方法,能够以可控的任意程度实现伪特征影响的事后消除。我们将伪特征概念化为原始类别内的虚构子类,可通过类别移除方案予以消除。随后,我们提出一种独特的精确类别移除技术,该技术仅需修改单个权重,对剩余类别的性能影响可忽略不计。我们进行了大量实验,结果表明仅通过事后编辑单个权重,我们的方法就能达到与最先进方法高度相当甚至更优的性能。