Global survey research increasingly informs high-stakes decisions in AI governance and cross-cultural policy, yet no standardized metric quantifies how well a sample's demographic composition matches its target population. Response rates and demographic quotas -- the prevailing proxies for sample quality -- measure effort and coverage but not distributional fidelity. This paper introduces the Global Representativeness Index (GRI), a framework grounded in Total Variation Distance that scores any survey sample against population benchmarks across multiple demographic dimensions on a [0, 1] scale. Validation on seven waves of the Global Dialogues survey (N = 7,500 across 60+ countries) finds fine-grained demographic GRI scores of only 0.33--0.36 -- roughly 43% of the theoretical maximum at that sample size. Cross-validation on the World Values Survey (seven waves, N = 403,000), Afrobarometer Round 9 (N = 53,000), and Latinobarometro (N = 19,000) reveals that even large probability surveys score below 0.22 on fine-grained global demographics when country coverage is limited. The GRI connects to classical survey statistics through the design effect; both metrics are recommended as a minimum summary of sample quality, since GRI quantifies demographic distance symmetrically while effective N captures the asymmetric inferential cost of underrepresentation. The framework is released as an open-source Python library with UN and Pew Research Center population benchmarks, applicable to survey research, machine learning dataset auditing, and AI evaluation benchmarks.
翻译:全球调查研究日益为人工智能治理与跨文化政策中的高风险决策提供依据,然而目前尚无标准化指标能够量化样本人口统计构成与目标总体的匹配程度。作为样本质量主要代理指标的应答率与人口统计配额——仅能衡量调查投入与覆盖范围,却无法反映分布保真度。本文提出全球代表性指数(GRI),该框架基于全变差距离理论,可在[0, 1]区间内对任意调查样本在多重人口统计维度上相对于总体基准的匹配程度进行评分。通过对七轮全球对话调查(覆盖60余个国家,N = 7,500)的验证发现,细粒度人口统计维度的GRI得分仅为0.33–0.36——约相当于该样本量下理论最大值的43%。在世界价值观调查(七轮,N = 403,000)、第九轮非洲晴雨表调查(N = 53,000)及拉丁美洲晴雨表调查(N = 19,000)的交叉验证中显示,当国家覆盖范围有限时,即便是大规模概率抽样调查在细粒度全球人口统计维度上的得分也低于0.22。GRI通过设计效应与经典调查统计学建立关联;建议将这两项指标共同作为样本质量的最低限度摘要,因为GRI以对称方式量化人口统计距离,而有效样本量则捕捉了代表性不足带来的非对称推断成本。本框架已作为开源Python库发布,其中整合了联合国与皮尤研究中心的总体基准数据,可应用于调查研究、机器学习数据集审计及人工智能评估基准领域。