Bias Leaves a Gradient Trail: Label-Free Bias Identification via Gradient Probes on Concept Decompositions

Vision classifiers can exploit spurious correlations, achieving high in-distribution accuracy yet failing under distribution shift. Existing approaches to bias mitigation and analysis often depend on curated datasets, spurious-attribute or group labels, or retraining, which may be infeasible once a model is deployed or the relevant bias is unknown. We present a bias-label-free, post-hoc method for identifying spurious concepts in frozen vision models, relying only on standard class labels from a held-out audit dataset. For each target class, we collect patches from inputs predicted as that class and apply non-negative matrix factorization to intermediate activations to obtain a bank of interpretable concept vectors. Candidate concepts are then ranked with a bias estimator derived from their interaction with backpropagated gradients on misclassified examples: bias concepts tend to get activated when correcting false negatives and suppressed when correcting false positives. On Colored MNIST and Waterbirds the method recovers concepts aligned with the known spurious cue, and on CelebA it surfaces decision-relevant directions that only partially coincide with the annotated gender attribute; suppressing the top-ranked concepts at inference time improves worst-group accuracy by up to 17.9 percentage points on Waterbirds and 10.4 on CelebA without any retraining or parameter updates. Our method identifies decision-relevant spurious directions that need not coincide with annotated ones, providing both an interpretable auditing tool and an actionable debiasing handle for frozen vision models. Code is available at https://github.com/vitryt/label-free-bias-identification.

翻译：视觉分类器可能利用虚假相关性，在分布内数据上取得高准确率，却在分布偏移时表现失败。现有的偏差缓解与分析方法通常依赖精心整理的数据集、虚假属性或群体标签，或需要重新训练——这在模型部署后或相关偏差未知时可能不可行。我们提出一种无需偏差标签的事后方法，用于识别冻结视觉模型中的虚假概念，仅依赖来自留存审计数据集的标准类别标签。针对每个目标类别，我们从被预测为该类别的输入中提取图像块，并对中间层激活进行非负矩阵分解，以获得一组可解释的概念向量。随后，通过基于误分类样本反向传播梯度相互作用导出的偏差估计器对候选概念进行排序：偏差概念在修正假阴性时倾向于被激活，而在修正假阳性时被抑制。在Colored MNIST和Waterbirds数据集上，该方法成功恢复了与已知虚假线索一致的概念；在CelebA上，它揭示了仅部分与标注性别属性重合的决策相关方向。在推理时抑制排名靠前的概念，无需任何重新训练或参数更新，即可将Waterbirds的最差组准确率提升高达17.9个百分点，CelebA提升10.4个百分点。我们的方法能够识别与标注属性无需一致的决策相关虚假方向，为冻结视觉模型同时提供了可解释的审计工具与可操作的去偏手段。代码已开源：https://github.com/vitryt/label-free-bias-identification。