We present a simple but effective method to measure and mitigate model biases caused by reliance on spurious cues. Instead of requiring costly changes to one's data or model training, our method better utilizes the data one already has by sorting them. Specifically, we rank images within their classes based on spuriosity (the degree to which common spurious cues are present), proxied via deep neural features of an interpretable network. With spuriosity rankings, it is easy to identify minority subpopulations (i.e. low spuriosity images) and assess model bias as the gap in accuracy between high and low spuriosity images. One can even efficiently remove a model's bias at little cost to accuracy by finetuning its classification head on low spuriosity images, resulting in fairer treatment of samples regardless of spuriosity. We demonstrate our method on ImageNet, annotating $5000$ class-feature dependencies ($630$ of which we find to be spurious) and generating a dataset of $325k$ soft segmentations for these features along the way. Having computed spuriosity rankings via the identified spurious neural features, we assess biases for $89$ diverse models and find that class-wise biases are highly correlated across models. Our results suggest that model bias due to spurious feature reliance is influenced far more by what the model is trained on than how it is trained.
翻译:我们提出一种简单而有效的方法,用于测量和缓解模型因依赖伪线索而产生的偏见。该方法无需对数据或模型训练进行高成本修改,而是通过排序更好地利用已有数据。具体而言,我们在每个类别内部基于伪关联度(常见伪线索存在的程度)对图像进行排序,伪关联度通过可解释网络的深度神经特征进行代理估计。借助伪关联度排名,可轻松识别少数子群体(即低伪关联度图像),并通过高伪关联度图像与低伪关联度图像之间的准确率差距评估模型偏见。通过仅在低伪关联度图像上微调分类头部,我们甚至能以极小准确率损失高效消除模型偏见,从而无论样本伪关联度如何,均能实现更公平的处理。我们在ImageNet上展示了该方法,标注了$5000$个类别-特征依赖关系(其中$630$个被判定为伪关联),并生成了一个包含$325k$个软分割掩码的数据集。通过已识别的伪神经特征计算伪关联度排名后,我们评估了$89$个不同模型的偏见,发现类别级偏见在不同模型间高度相关。结果表明,模型因伪特征依赖产生的偏见受训练数据的影响远大于训练方法。