Although Vision-Language Models (VLMs) have achieved remarkable success, the knowledge mechanisms underlying their social biases remain a black box, where fairness- and ethics-related problems harm certain groups of people in society. It is unknown to what extent VLMs yield gender and race bias in generative responses. In this paper, we conduct a systematic discovery of gender and race bias in state-of-the-art VLMs, focusing not only on surface-level responses but also on the internal probability distributions and hidden state dynamics. Our empirical analysis reveals three critical findings: 1) The Fairness Paradox: Models often generate fair text labels while maintaining highly skewed confidence scores (mis-calibration) toward specific social groups. 2) Layer-wise Fluctuation: Fairness knowledge is not uniformly distributed; it peaks in intermediate layers and undergoes substantial knowledge erosion in the final layers. 3) Residual Discrepancy: Within a single hidden layer, different residual streams carry conflicting social knowledge - some reinforcing fairness while others amplifying bias. Leveraging these insights, we propose RES-FAIR (RESidual Flow Adjustment for Inference Recalibration), a post-hoc framework that mitigates bias by localizing and projecting hidden states away from biased residual directions while amplifying fair components. Evaluations on PAIRS and SocialCounterfactuals datasets demonstrate that our discovery-based approach significantly improves response fairness and confidence calibration without compromising general reasoning abilities. Our work provides a new lens for understanding how multi-modal models store and process sensitive social information.
翻译:尽管视觉-语言模型(VLMs)已取得显著成功,但其社会偏见背后的知识机制仍是一个黑箱,其中涉及公平与伦理的问题对社会中的特定群体造成了损害。目前尚不清楚VLMs在生成响应中产生性别与种族偏见的程度。本文对前沿VLMs中的性别与种族偏见进行了系统性探索,不仅关注表层响应,还深入分析了内部概率分布与隐藏状态动态。实证分析揭示了三个关键发现:1)公平悖论:模型常生成公平的文本标签,却对特定社会群体保持高度倾斜的置信度分数(校准失准)。2)层级波动:公平知识并非均匀分布;其在中间层达到峰值,并在最终层经历显著的知识侵蚀。3)残差异质性:在单个隐藏层内,不同的残差流承载着相互冲突的社会知识——部分强化公平性,而另一部分则放大偏见。基于这些洞见,我们提出了RES-FAIR(基于残差流调整的推理再校准框架),一种后处理框架,通过定位并将隐藏状态从有偏残差方向投影离开,同时增强公平成分,从而缓解偏见。在PAIRS和SocialCounterfactuals数据集上的评估表明,我们基于发现的方法在保持通用推理能力的同时,显著提升了响应公平性与置信度校准水平。本研究为理解多模态模型如何存储和处理敏感社会信息提供了新的视角。