Global survey research increasingly informs high-stakes decisions in AI governance and cross-cultural policy, yet no standardized metric quantifies how well a sample's demographic composition matches its target population. Response rates and demographic quotas -- the prevailing proxies for sample quality -- measure effort and coverage but not distributional fidelity. This paper introduces the Global Representativeness Index (GRI), a framework grounded in Total Variation Distance that scores any survey sample against population benchmarks across multiple demographic dimensions on a [0, 1] scale. Validation on seven waves of the Global Dialogues survey (N = 7,500 across 60+ countries) finds fine-grained demographic GRI scores of only 0.33--0.36 -- roughly 43% of the theoretical maximum at that sample size. Cross-validation on the World Values Survey (seven waves, N = 403,000), Afrobarometer Round 9 (N = 53,000), and Latinobarometro (N = 19,000) reveals that even large probability surveys score below 0.22 on fine-grained global demographics when country coverage is limited. The GRI connects to classical survey statistics through the design effect; both metrics are recommended as a minimum summary of sample quality, since GRI quantifies demographic distance symmetrically while effective N captures the asymmetric inferential cost of underrepresentation. The framework is released as an open-source Python library with UN and Pew Research Center population benchmarks, applicable to survey research, machine learning dataset auditing, and AI evaluation benchmarks.
翻译:全球调查研究日益为人工智能治理和跨文化政策中的高风险决策提供依据,然而目前尚无标准化指标能够量化样本的人口统计构成与其目标群体的匹配程度。回应率和人口统计配额——作为样本质量的主流代理指标——仅能衡量调查投入和覆盖范围,却无法反映分布保真度。本文提出全球代表性指数,该框架基于全变差距离理论,能够在[0, 1]区间内对任意调查样本在多重人口统计维度上相对于总体基准的匹配程度进行评分。通过对全球对话调查的七轮数据(涵盖60余个国家,N = 7,500)进行验证,发现细粒度人口统计维度的GRI得分仅为0.33-0.36——约相当于该样本量下理论最大值的43%。在世界价值观调查(七轮数据,N = 403,000)、非洲晴雨表第九轮(N = 53,000)和拉丁美洲晴雨表(N = 19,000)上的交叉验证表明,当国家覆盖范围有限时,即便是大规模概率抽样调查在细粒度全球人口统计维度上的得分也低于0.22。GRI通过设计效应与经典调查统计学建立关联;建议将这两个指标共同作为样本质量的最低限度摘要,因为GRI能够对称量化人口统计距离,而有效样本量则捕捉了代表性不足所导致的不对称推断成本。该框架已作为开源Python库发布,并整合了联合国和皮尤研究中心的总体基准数据,可广泛应用于调查研究、机器学习数据集审计及人工智能评估基准领域。