The accurate evaluation of differential treatment in language models to specific groups is critical to ensuring a positive and safe user experience. An ideal evaluation should have the properties of being robust, extendable to new groups or attributes, and being able to capture biases that appear in typical usage (rather than just extreme, rare cases). Relatedly, bias evaluation should surface not only egregious biases but also ones that are subtle and commonplace, such as a likelihood for talking about appearances with regard to women. We present FairPair, an evaluation framework for assessing differential treatment that occurs during ordinary usage. FairPair operates through counterfactual pairs, but crucially, the paired continuations are grounded in the same demographic group, which ensures equivalent comparison. Additionally, unlike prior work, our method factors in the inherent variability that comes from the generation process itself by measuring the sampling variability. We present an evaluation of several commonly used generative models and a qualitative analysis that indicates a preference for discussing family and hobbies with regard to women.
翻译:准确评估语言模型对特定群体的差异处理对于确保积极且安全的用户体验至关重要。理想的评估应具备稳健性、可扩展至新群体或属性,并能捕捉典型使用场景中出现的偏见(而非仅关注极端罕见案例)。与此相关的是,偏见评估不仅应揭示显著偏见,还应包括细微且普遍存在的偏见,例如更倾向讨论女性外貌。我们提出FairPair评估框架,用于评估日常使用中出现的差异处理。FairPair基于反事实对运作,但关键在于,其配对续写内容均属于同一人口统计学群体,从而确保等价比较。此外,与先前工作不同,我们的方法通过测量采样变异性,将生成过程本身固有的可变性纳入考量。我们针对几种常用生成模型进行了评估,并开展定性分析,结果表明模型更倾向于讨论与女性相关的家庭及兴趣爱好话题。