Question-answering (QA) and reading comprehension (RC) benchmarks are commonly used for assessing the capabilities of large language models (LLMs) to retrieve and reproduce knowledge. However, we demonstrate that popular QA and RC benchmarks do not cover questions about different demographics or regions in a representative way. We perform a content analysis of 30 benchmark papers and a quantitative analysis of 20 respective benchmark datasets to learn (1) who is involved in the benchmark creation, (2) whether the benchmarks exhibit social bias, or whether this is addressed or prevented, and (3) whether the demographics of the creators and annotators correspond to particular biases in the content. Most benchmark papers analyzed provide insufficient information about those involved in benchmark creation, particularly the annotators. Notably, just one (WinoGrande) explicitly reports measures taken to address social representation issues. Moreover, the data analysis revealed gender, religion, and geographic biases across a wide range of encyclopedic, commonsense, and scholarly benchmarks. Our work adds to the mounting criticism of AI evaluation practices and shines a light on biased benchmarks being a potential source of LLM bias by incentivizing biased inference heuristics.
翻译:问答(QA)与阅读理解(RC)基准通常用于评估大型语言模型(LLM)检索和复现知识的能力。然而,我们证明流行的QA和RC基准未能以具有代表性的方式涵盖关于不同人口群体或地区的问题。我们对30篇基准论文进行了内容分析,并对20个相应的基准数据集进行了定量分析,以探究:(1)谁参与了基准的创建;(2)基准是否表现出社会偏见,或者是否对此进行了处理或预防;(3)创建者和标注者的人口统计学特征是否与内容中的特定偏见相对应。大多数被分析的基准论文未能提供关于基准创建参与者(尤其是标注者)的充分信息。值得注意的是,仅有一个基准(WinoGrande)明确报告了为解决社会代表性议题所采取的措施。此外,数据分析揭示了在广泛的百科全书式、常识性和学术性基准中普遍存在的性别、宗教和地域偏见。我们的工作为日益增多的对人工智能评估实践的批评增添了新的证据,并揭示了有偏见的基准可能通过激励有偏见的推理启发式方法而成为LLM偏见的潜在来源。