How Hyper-Datafication Impacts the Sustainability Costs in Frontier AI

Large-scale data has fuelled the success of frontier artificial intelligence (AI) models over the past decade. This expansion has relied on sustained efforts by large technology corporations to aggregate and curate internet-scale datasets. In this work, we examine the environmental, social, and economic costs of large-scale data in AI through a sustainability lens. We argue that the field is shifting from building models from data to actively creating data for building models. We characterise this transition as hyper-datafication, which marks a critical juncture for the future of frontier AI and its societal impacts. To quantify and contextualise data-related costs, we analyse approximately 550,000 datasets from the Hugging Face Hub, focusing on dataset growth, storage-related energy consumption and carbon footprint, and societal representation using language data. We complement this analysis with qualitative responses from data workers in Kenya to examine the labour involved, including direct employment by big tech corporations and exposure to graphic content. We further draw on external data sources to substantiate our findings by illustrating the global disparity in data centre infrastructure. Our analyses reveal that hyper-datafication drives substantial and growing environmental costs while systematically redistributing labour risks and representational harms toward the Global South. Thus, we propose Data PROOFS recommendations spanning provenance, resource awareness, ownership, openness, frugality, and standards to mitigate these costs. Our work aims to make visible the often-overlooked costs of data that underpin frontier AI and to stimulate broader debate within the research community and beyond.

翻译：大规模数据在过去十年中推动了前沿人工智能模型的成功。这种扩展依赖于大型科技公司持续努力聚合和整理互联网规模的数据集。本研究通过可持续性视角审视人工智能中大规模数据的环境、社会和经济成本。我们认为该领域正从基于数据构建模型转向主动创建数据以构建模型，并将这一转变特征化为"超数据化"，这标志着前沿人工智能及其社会影响的关键转折点。为量化并背景化数据相关成本，我们分析了Hugging Face Hub上的约55万个数据集，重点关注数据集增长、存储相关能耗与碳足迹，以及通过语言数据体现的社会代表性。我们通过肯尼亚数据工作者的定性反馈补充分析，考察其中涉及的劳动力问题，包括受雇于大型科技公司的直接就业及接触露骨内容的情况。进一步借助外部数据源，通过展示数据中心基础设施的全球分布差异来佐证研究结论。分析表明，超数据化在推动环境成本大幅增长的同时，系统性地将劳动力风险与表征危害重新分配至全球南方。为此，我们提出涵盖溯源、资源意识、所有权、开放性、节约性与标准化六维度的"数据溯源"建议以缓解这些成本。本研究旨在揭示支撑前沿人工智能却常被忽视的数据成本，并激发学术界及更广泛领域的深入讨论。