衡量关键要素：人工智能多元性指数 (Measuring What Matters: The AI Pluralism Index)

Artificial intelligence systems increasingly mediate knowledge, communication, and decision making. Development and governance remain concentrated within a small set of firms and states, raising concerns that technologies may encode narrow interests and limit public agency. Capability benchmarks for language, vision, and coding are common, yet public, auditable measures of pluralistic governance are rare. We define AI pluralism as the degree to which affected stakeholders can shape objectives, data practices, safeguards, and deployment. We present the AI Pluralism Index (AIPI), a transparent, evidence-based instrument that evaluates producers and system families across four pillars: participatory governance, inclusivity and diversity, transparency, and accountability. AIPI codes verifiable practices from public artifacts and independent evaluations, explicitly handling "Unknown" evidence to report both lower-bound ("evidence") and known-only scores with coverage. We formalize the measurement model; implement a reproducible pipeline that integrates structured web and repository analysis, external assessments, and expert interviews; and assess reliability with inter-rater agreement, coverage reporting, cross-index correlations, and sensitivity analysis. The protocol, codebook, scoring scripts, and evidence graph are maintained openly with versioned releases and a public adjudication process. We report pilot provider results and situate AIPI relative to adjacent transparency, safety, and governance frameworks. The index aims to steer incentives toward pluralistic practice and to equip policymakers, procurers, and the public with comparable evidence.

翻译：人工智能系统日益成为知识传播、沟通交流和决策制定的中介。然而，其开发与治理权仍集中在少数企业和国家手中，这引发了技术可能仅体现狭隘利益并限制公众能动性的担忧。尽管针对语言、视觉和编码的能力基准测试已很常见，但面向公众、可审计的多元治理衡量标准却十分匮乏。本文将人工智能多元性定义为受影响的利益相关者能够对系统目标、数据实践、安全保障及部署方式施加影响的程度。我们提出了人工智能多元性指数——一个透明、基于证据的评估工具，从四大支柱维度对开发者和系统家族进行评价：参与式治理、包容性与多样性、透明度及问责制。该指数通过公开资料和独立评估对可验证的实践进行编码，并明确处理"未知"证据，从而同时报告具有覆盖率的保守下限得分与已知证据得分。我们形式化了该测量模型；实现了一个可复现的评估流程，整合了结构化网络与代码库分析、外部评估及专家访谈；并通过评分者间一致性检验、覆盖率报告、跨指标相关性分析和敏感性分析来评估其可靠性。评估规程、编码手册、评分脚本及证据图谱均通过版本化发布和公共裁定流程进行开源维护。我们报告了试点供应商的评估结果，并将该指数置于相邻的透明度、安全性与治理框架中进行定位。本指数旨在引导激励措施向多元实践倾斜，并为政策制定者、采购方及公众提供可比较的证据依据。