Large-scale data has fuelled the success of frontier artificial intelligence (AI) models over the past decade. This expansion has relied on sustained efforts by large technology corporations to aggregate and curate internet-scale datasets. In this work, we examine the environmental, social, and economic costs of large-scale data in AI through a sustainability lens. We argue that the field is shifting from building models from data to actively creating data for building models. We characterise this transition as hyper-datafication, which marks a critical juncture for the future of frontier AI and its societal impacts. To quantify and contextualise data-related costs, we analyse approximately 550,000 datasets from the Hugging Face Hub, focusing on dataset growth, storage-related energy consumption and carbon footprint, and societal representation using language data. We complement this analysis with qualitative responses from data workers in Kenya to examine the labour involved, including direct employment by big tech corporations and exposure to graphic content. We further draw on external data sources to substantiate our findings by illustrating the global disparity in data centre infrastructure. Our analyses reveal that hyper-datafication does not merely increase resource consumption but systematically redistributes environmental burdens, labour risks, and representational harms toward the Global South, precarious data workers, and under-represented cultures. Thus, we propose Data PROOFS recommendations spanning provenance, resource awareness, ownership, openness, frugality, and standards to mitigate these costs. Our work aims to make visible the often-overlooked costs of data that underpin frontier AI and to stimulate broader debate within the research community and beyond.
翻译:大规模数据在过去十年中推动了前沿人工智能模型的发展。这一扩张依赖于大型科技公司持续努力聚合和策划互联网规模的数据集。本研究通过可持续性视角审视人工智能中大规模数据的环境、社会和经济成本。我们认为该领域正从利用数据构建模型转向主动为构建模型创造数据。我们将这一转变定义为超数据化,这标志着前沿人工智能及其社会影响的未来处于一个关键节点。为了量化和情境化数据相关成本,我们分析了来自Hugging Face Hub的约55万个数据集,重点关注数据集增长、存储相关能耗与碳足迹,以及使用语言数据的社会代表性分析。我们通过肯尼亚数据工作者的质性访谈补充这一分析,以考察其中涉及的劳动问题,包括大型科技公司的直接雇佣和接触不良内容的风险。我们进一步引用外部数据源,通过说明数据中心基础设施的全球差异来佐证研究发现。我们的分析表明,超数据化不仅增加了资源消耗,更系统性地将环境负担、劳动风险和表征性危害转移至全球南方国家、不稳定的数据工作者和代表性不足的文化群体。为此,我们提出涵盖数据溯源、资源意识、所有权、开放性、节俭性和标准化的Data PROOFS建议以缓解这些成本。本研究旨在揭示支撑前沿人工智能的数据中常被忽视的成本,并推动研究界及更广泛领域的深入讨论。