This position paper argues that the machine learning community must move from preaching to practising data frugality for responsible artificial intelligence (AI) development. For too long, progress has been equated with ever-larger datasets, driving remarkable advances but now yielding increasingly diminishing performance gains alongside rising energy use and carbon emissions. While awareness of data frugal approaches has grown, their adoption has remained rhetorical, and data scaling continues to dominate development practice. We argue that this gap between preach and practice must be closed, as continued data scaling entails substantial and under-accounted environmental impacts. To ground our position, we provide indicative estimates of the energy use and carbon emissions associated with the downstream use of ImageNet-1K. We then present empirical evidence that data frugality is both practical and beneficial, demonstrating that subset selection methods can substantially reduce training energy consumption with little loss in accuracy, while also mitigating dataset bias. Finally, we outline actionable recommendations for moving data frugality from rhetorical preaching to concrete practice for responsible development of AI.
翻译:本文立场声明,主张机器学习领域必须从空谈转向践行数据节俭,以负责任地发展人工智能(AI)。长期以来,技术进步一直等同于使用越来越大的数据集,这虽然推动了显著进展,但如今正呈现边际性能收益递减、而能源消耗和碳排放持续上升的趋势。尽管对数据节俭方法的认识有所提高,但其采纳仍停留在口头上,数据规模扩张仍主导着开发实践。我们认为,必须弥合这种说与做之间的差距,因为持续的数据扩张会带来巨大且未被充分核算的环境影响。为佐证我们的立场,我们预估了ImageNet-1K下游使用相关的能源消耗和碳排放量。接着,我们提供经验证据表明数据节俭既实用又有益,证明子集选择方法能在几乎不牺牲准确率的情况下大幅降低训练能源消耗,同时减轻数据集偏差。最后,我们概述了将数据节俭从口头宣传转化为负责任地开发AI的具体实践的可操作建议。