Large Language Models (LLMs) are increasingly deployed in socially sensitive settings, raising concerns about fairness and biases, particularly across intersectional demographic attributes. In this paper, we systematically evaluate intersectional fairness in six LLMs using ambiguous and disambiguated contexts from two benchmark datasets. We assess LLM behavior using bias scores, subgroup fairness metrics, accuracy, and consistency through multi-run analysis across contexts and negative and non-negative question polarities. Our results show that while modern LLMs generally perform well in ambiguous contexts, this limits the informativeness of fairness metrics due to sparse non-unknown predictions. In disambiguated contexts, LLM accuracy is influenced by stereotype alignment, with models being more accurate when the correct answer reinforces a stereotype than when it contradicts it. This pattern is especially pronounced in race-gender intersections, where directional bias toward stereotypes is stronger. Subgroup fairness metrics further indicate that, despite low observed disparity in some cases, outcome distributions remain uneven across intersectional groups. Across repeated runs, responses also vary in consistency, including stereotype-aligned responses. Overall, our findings show that apparent model competence is partly associated with stereotype-consistent cues, and no evaluated LLM achieves consistently reliable or fair behavior across intersectional settings. These findings highlight the need for evaluation beyond accuracy, emphasizing the importance of combining bias, subgroup fairness, and consistency metrics across intersectional groups, contexts, and repeated runs.
翻译:大型语言模型(LLMs)越来越多地部署在社会敏感场景中,引发了关于公平性和偏见(尤其是跨交叉人口属性维度)的担忧。本文利用两个基准数据集中的模糊与消歧上下文,系统评估了六种LLM的交叉公平性。我们通过偏见分数、子组公平性指标、准确率以及跨上下文及正负问题极性多轮运行分析中的一致性来评估LLM行为。结果表明:尽管现代LLM在模糊上下文中普遍表现良好,但由于非未知预测稀疏,这限制了公平性指标的信息量;在消歧上下文中,LLM的准确率受刻板印象对齐影响——当正确答案强化刻板印象时,模型的准确率高于其与刻板印象矛盾时的情形。这种模式在种族-性别交叉维度上尤为显著,其对刻板印象的方向性偏见更强。子组公平性指标进一步表明,尽管某些情况下观察到的差异度较低,但跨交叉群体的结果分布仍不均衡。在多轮运行中,响应的一致性(包括刻板印象对齐响应)也存在差异。总体而言,我们的发现表明,模型表面能力部分与刻板印象一致性线索相关,且所评估的LLM均未能在交叉设定下实现持续可靠或公平的行为。这些发现凸显了超越准确率评估的必要性,强调了需结合偏见、子组公平性及一致性指标,综合考量交叉群体、上下文及多轮运行结果。