How Hyper-Datafication Impacts the Sustainability Costs in Frontier AI

Large-scale data has fuelled the success of frontier artificial intelligence (AI) models over the past decade. This expansion has relied on sustained efforts by large technology corporations to aggregate and curate internet-scale datasets. In this work, we examine the environmental, social, and economic costs of large-scale data in AI through a sustainability lens. We argue that the field is shifting from building models from data to actively creating data for building models. We characterise this transition as hyper-datafication, which marks a critical juncture for the future of frontier AI and its societal impacts. To quantify and contextualise data-related costs, we analyse approximately 550,000 datasets from the Hugging Face Hub, focusing on dataset growth, storage-related energy consumption and carbon footprint, and societal representation using language data. We complement this analysis with qualitative responses from data workers in Kenya to examine the labour involved, including direct employment by big tech corporations and exposure to graphic content. We further draw on external data sources to substantiate our findings by illustrating the global disparity in data centre infrastructure. Our analyses reveal that hyper-datafication does not merely increase resource consumption but systematically redistributes environmental burdens, labour risks, and representational harms toward the Global South, precarious data workers, and under-represented cultures. Thus, we propose Data PROOFS recommendations spanning provenance, resource awareness, ownership, openness, frugality, and standards to mitigate these costs. Our work aims to make visible the often-overlooked costs of data that underpin frontier AI and to stimulate broader debate within the research community and beyond.

翻译：大规模数据在过去十年中推动了前沿人工智能（AI）模型的成功。这种扩张依赖于大型科技公司持续努力聚合和整理互联网规模的数据集。在本工作中，我们通过可持续性视角审视AI中大规模数据的环境、社会与经济成本。我们认为该领域正从基于数据构建模型转向为构建模型主动创造数据。我们将这种转变表征为"超数据化"，它标志着前沿AI及其社会影响的关键转折点。为量化并情境化与数据相关的成本，我们分析了Hugging Face Hub中约55万个数据集，重点关注数据集增长、存储相关的能源消耗与碳足迹，以及使用语言数据反映的社会表征。我们通过肯尼亚数据工作者定性反馈补充分析，审视其中涉及的劳动（包括大型科技公司直接雇佣及接触图形化内容的情况）。我们进一步借助外部数据源，通过展示数据中心基础设施的全球差异来佐证研究结论。分析表明，超数据化不仅增加资源消耗，更系统性地将环境负担、劳动风险及表征危害向全球南方、边缘数据工作者及代表性不足的文化群体转移。为此，我们提出涵盖数据溯源、资源感知、所有权、开放性、简约性及标准化等维度的Data PROOFS建议，以缓解这些成本。本工作旨在揭示支持前沿AI却常被忽视的数据成本，并激发学术界及更广泛群体的深入讨论。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

人工智能对特定国防资源管理流程的影响（万字长文）

专知会员服务

14+阅读 · 5月8日

前沿人工智能趋势报告（Frontier AI Trends Report）

专知会员服务

39+阅读 · 2025年12月20日

【ETZH博士论文】数据驱动的人工智能

专知会员服务

41+阅读 · 2025年2月21日

《人工智能对传统人工情报分析的影响》最新报告

专知会员服务

55+阅读 · 2024年10月10日