Who Evaluates AI's Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations

Anka Reuel,Avijit Ghosh,Jenny Chim,Andrew Tran,Yanan Long,Jennifer Mickel,Usman Gohar,Srishti Yadav,Pawan Sasanka Ammanamanchi,Mowafak Allaham,Hossein A. Rahmani,Mubashara Akhtar,Felix Friedrich,Robert Scholz,Michael Alexander Riegler,Jan Batzner,Eliya Habba,Arushi Saxena,Anastassia Kornilova,Kevin Wei,Prajna Soni,Yohan Mathew,Kevin Klyman,Jeba Sania,Subramanyam Sahoo,Olivia Beyer Bruvik,Pouya Sadeghi,Sujata Goswami,Angelina Wang,Yacine Jernite,Zeerak Talat,Stella Biderman,Mykel Kochenderfer,Sanmi Koyejo,Irene Solaiman

from arxiv, Accepted at the Forty-Third International Conference on Machine Learning (ICML), 2026, in Seoul, Korea

Foundation models are increasingly central to high-stakes AI systems, and governance frameworks now depend on evaluations to assess their risks and capabilities. Although general capability evaluations are widespread, social impact assessments covering bias, fairness, privacy, environmental costs, and labor remain uneven. To characterize this landscape, we conduct the first comprehensive analysis of social impact evaluation reporting, examining 186 first-party release reports and 248 third-party evaluation sources, supplemented by developer interviews. We find a stark division of labor: first-party reporting is sparse, often superficial, and declining in areas like environmental impact and bias, while third-party evaluators provide broader, more rigorous coverage of bias, harmful content, and performance disparities. However, only developers can authoritatively report on data provenance, content moderation labor, costs, and infrastructure, yet interviews reveal these disclosures are deprioritized unless tied to product adoption or compliance. Current practices leave major gaps in assessing societal impacts, underscoring the need for policies that mandate developer transparency, strengthen independent evaluation ecosystems, and create shared infrastructure for aggregating third-party evaluations.

翻译：基础模型日益成为高风险人工智能系统的核心，治理框架现在依赖评估来衡量其风险与能力。尽管通用能力评估已广泛开展，但涵盖偏见、公平性、隐私、环境成本和劳动影响的社会影响评估仍不均衡。为刻画这一格局，我们对社会影响评估报告进行了首次全面分析，审查了248份第三方评估来源和186份第一方发布报告，并辅以开发者访谈。我们发现显著的劳动分工：第一方报告内容稀疏、往往流于表面，且在环境影响和偏见等领域呈下降趋势；而第三方评估者在偏见、有害内容及性能差异方面提供了更广泛、更严格的覆盖。然而，只有开发者能够权威地报告数据溯源、内容审核劳动、成本和基础设施，但访谈揭示，除非与产品采用或合规性挂钩，这些披露内容被置于次要地位。当前实践在评估社会影响方面存在重大缺口，凸显了制定政策以强制开发者透明度、强化独立评估生态系统并创建聚合第三方评估的共享基础设施的必要性。