The development of unbiased large language models is widely recognized as crucial, yet existing benchmarks fall short in detecting biases due to limited scope, contamination, and lack of a fairness baseline. SAGED(-Bias) is the first holistic benchmarking pipeline to address these problems. The pipeline encompasses five core stages: scraping materials, assembling benchmarks, generating responses, extracting numeric features, and diagnosing with disparity metrics. SAGED includes metrics for max disparity, such as impact ratio, and bias concentration, such as Max Z-scores. Noticing that assessment tool bias and contextual bias in prompts can distort evaluation, SAGED implements counterfactual branching and baseline calibration for mitigation. For demonstration, we use SAGED on G20 Countries with popular 8b-level models including Gemma2, Llama3.1, Mistral, and Qwen2. With sentiment analysis, we find that while Mistral and Qwen2 show lower max disparity and higher bias concentration than Gemma2 and Llama3.1, all models are notably biased against countries like Russia and (except for Qwen2) China. With further experiments to have models role-playing U.S. (vice-/former-) presidents, we see bias amplifies and shifts in heterogeneous directions. Moreover, we see Qwen2 and Mistral not engage in role-playing, while Llama3.1 and Gemma2 role-play Trump notably more intensively than Biden and Harris, indicating role-playing performance bias in these models.
翻译:无偏见大型语言模型的开发被广泛认为是至关重要的,然而现有的基准测试由于范围有限、数据污染以及缺乏公平性基线,在检测偏见方面存在不足。SAGED(-Bias)是首个解决这些问题的整体基准测试流程。该流程包含五个核心阶段:材料爬取、基准组装、响应生成、数值特征提取以及基于差异度量的诊断。SAGED包含了最大差异度量(如影响比率)和偏见集中度度量(如最大Z分数)。注意到评估工具偏见和提示中的语境偏见会扭曲评估结果,SAGED实施了反事实分支和基线校准以进行缓解。作为演示,我们使用SAGED对G20国家进行了测试,涵盖了包括Gemma2、Llama3.1、Mistral和Qwen2在内的流行8b级别模型。通过情感分析,我们发现,虽然Mistral和Qwen2显示出比Gemma2和Llama3.1更低的最大差异和更高的偏见集中度,但所有模型都对俄罗斯和(除Qwen2外)中国等国家存在显著偏见。通过进一步的实验,让模型扮演美国(副/前)总统的角色,我们看到偏见被放大并向异质方向转移。此外,我们发现Qwen2和Mistral不参与角色扮演,而Llama3.1和Gemma2扮演特朗普角色的强度明显高于拜登和哈里斯,这表明这些模型存在角色扮演性能偏见。