Standard fairness audits of foundation models quantify that a model is biased, but not where inside the network the bias resides. We propose a mechanistic fairness audit that combines projected residual-stream decomposition, zero-shot Concept Activation Vectors, and bias-augmented TextSpan analysis to locate demographic bias at the level of individual attention heads in vision transformers. As a feasibility case study, we apply this pipeline to the CLIP ViT-L-14 encoder on 42 profession classes of the FACET benchmark, auditing both gender and age bias. For gender, the pipeline identifies four terminal-layer heads whose ablation reduces global bias (Cramer's V: 0.381 -> 0.362) while marginally improving accuracy (+0.42%); a layer-matched random control confirms that this effect is specific to the identified heads. A single head in the final layer contributes to the majority of the reduction in the most stereotyped classes, and class-level analysis shows that corrected predictions shift toward the correct occupation. For age, the same pipeline identifies candidate heads, but ablation produces weaker and less consistent effects, suggesting that age bias is encoded more diffusely than gender bias in this model. These results provide preliminary evidence that head-level bias localisation is feasible for discriminative vision encoders and that the degree of localisability may vary across protected attributes. keywords: Bias . CLIP . Mechanistic Interpretability . Vision Transformer . Fairness
翻译:基础模型的标准化公平性审计仅量化模型存在偏差,但无法确定偏差在网络中的具体位置。我们提出一种机制化公平性审计方法,该方法结合投影残差流分解、零样本概念激活向量和偏差增强的TextSpan分析,以定位视觉Transformer中单个注意力头层面的人口统计偏差。作为可行性案例研究,我们将此流程应用于CLIP ViT-L-14编码器,针对FACET基准测试的42个职业类别进行性别和年龄偏差审计。对于性别偏差,该流程识别出四个末端层注意力头,其消融使全局偏差降低(Cramer's V:0.381 -> 0.362)同时略微提升准确率(+0.42%);层匹配随机对照实验证实该效果特定于所识别的注意力头。末层中单个注意力头对刻板印象最严重类别的偏差减少贡献最大,类别层面分析显示修正后的预测更趋向正确职业。对于年龄偏差,相同流程识别出候选注意力头,但消融产生较弱且不一致的效果,表明该模型中年龄偏差的编码方式比性别偏差更为分散。这些结果为判别式视觉编码器可实现注意力头层面偏差定位提供了初步证据,且可定位程度可能随保护属性不同而变化。关键词:偏差 . CLIP . 机制可解释性 . 视觉Transformer . 公平性