Large-scale data has fuelled the success of frontier artificial intelligence (AI) models over the past decade. This expansion has relied on sustained efforts by large technology corporations to aggregate and curate internet-scale datasets. In this work, we examine the environmental, social, and economic costs of large-scale data in AI through a sustainability lens. We argue that the field is shifting from building models from data to actively creating data for building models. We characterise this transition as hyper-datafication, which marks a critical juncture for the future of frontier AI and its societal impacts. To quantify and contextualise data-related costs, we analyse approximately 550,000 datasets from the Hugging Face Hub, focusing on dataset growth, storage-related energy consumption and carbon footprint, and societal representation using language data. We complement this analysis with qualitative responses from data workers in Kenya to examine the labour involved, including direct employment by big tech corporations and exposure to graphic content. We further draw on external data sources to substantiate our findings by illustrating the global disparity in data centre infrastructure. Our analyses reveal that hyper-datafication does not merely increase resource consumption but systematically redistributes environmental burdens, labour risks, and representational harms toward the Global South, precarious data workers, and under-represented cultures. Thus, we propose Data PROOFS recommendations spanning provenance, resource awareness, ownership, openness, frugality, and standards to mitigate these costs. Our work aims to make visible the often-overlooked costs of data that underpin frontier AI and to stimulate broader debate within the research community and beyond.
翻译:大规模数据在过去十年中推动了前沿人工智能(AI)模型的成功。这种扩张依赖于大型科技公司持续努力聚合和整理互联网规模的数据集。在本工作中,我们通过可持续性视角审视AI中大规模数据的环境、社会与经济成本。我们认为该领域正从基于数据构建模型转向为构建模型主动创造数据。我们将这种转变表征为"超数据化",它标志着前沿AI及其社会影响的关键转折点。为量化并情境化与数据相关的成本,我们分析了Hugging Face Hub中约55万个数据集,重点关注数据集增长、存储相关的能源消耗与碳足迹,以及使用语言数据反映的社会表征。我们通过肯尼亚数据工作者定性反馈补充分析,审视其中涉及的劳动(包括大型科技公司直接雇佣及接触图形化内容的情况)。我们进一步借助外部数据源,通过展示数据中心基础设施的全球差异来佐证研究结论。分析表明,超数据化不仅增加资源消耗,更系统性地将环境负担、劳动风险及表征危害向全球南方、边缘数据工作者及代表性不足的文化群体转移。为此,我们提出涵盖数据溯源、资源感知、所有权、开放性、简约性及标准化等维度的Data PROOFS建议,以缓解这些成本。本工作旨在揭示支持前沿AI却常被忽视的数据成本,并激发学术界及更广泛群体的深入讨论。