Bias evaluation for language models has made substantial progress on bounded comparisons, such as overt derogation, stereotype association, or label-sensitive differences under controlled substitutions. Open-ended explanations raise a different problem: they guide interpretation by assigning responsibility, legitimacy, context, and grievance. A model can avoid hostile language while making one side structurally understandable and another personally at fault, overreacting, or less worth taking seriously. We call this stance-bearing asymmetry in generative explanations. We propose Symmetry Decomposition Evaluation (SDE), which tests paired situations with concrete group labels, structural-role rewrites, and explicit support or counter-evidence. In a controlled 32-family prototype suite, this decomposition shows that surface differences are not all alike: some weaken under structural or evidence control, while others remain as stable differences in how the model assigns blame, context, or legitimacy. Targeted case review and judge comparison suggest a broader difficulty for evaluating open-ended framing asymmetries: judge readings shift across operationalizations, and scalar scores can flatten distinctions that readers use to interpret explanatory stance. SDE therefore reframes generative bias evaluation as an audit of explanatory stance -- what stance each side receives, how it changes under decomposition, and where automatic scoring becomes unstable.
翻译:语言模型的偏见评估在有限比较方面取得了显著进展,例如明显贬低、刻板印象关联或受控替换下的标签敏感差异。开放性解释提出了不同的问题:它们通过分配责任、合法性、背景和委屈来引导解读。模型可以避免敌意语言,同时使一方在结构上可理解,而另一方则被归咎于个人过错、反应过度或较不值得认真对待。我们将此称为生成性解释中的立场不对称。我们提出了对称性分解评估(SDE),该方法通过具体群体标签、结构性角色改写以及明确支持或反证来测试配对情境。在一个受控的32族原型套件中,这种分解表明表面差异并非全部相同:有些在结构性或证据控制下减弱,而另一些则作为模型如何分配责备、背景或合法性的稳定差异保留下来。针对性案例审查和评判者比较表明,评估开放性框架不对称性存在更广泛的困难:评判者解读在不同操作化方式下发生变化,标量评分可能抹平读者用于解读解释性立场的区别。因此,SDE将生成性偏见评估重新定义为对解释性立场的审计——每一方接受何种立场、其在分解下如何变化,以及自动评分在何处变得不稳定。