Exploring the Carbon Footprint of Hugging Face's ML Models: A Repository Mining Study

The rise of machine learning (ML) systems has exacerbated their carbon footprint due to increased capabilities and model sizes. However, there is scarce knowledge on how the carbon footprint of ML models is actually measured, reported, and evaluated. In light of this, the paper aims to analyze the measurement of the carbon footprint of 1,417 ML models and associated datasets on Hugging Face, which is the most popular repository for pretrained ML models. The goal is to provide insights and recommendations on how to report and optimize the carbon efficiency of ML models. The study includes the first repository mining study on the Hugging Face Hub API on carbon emissions. This study seeks to answer two research questions: (1) how do ML model creators measure and report carbon emissions on Hugging Face Hub?, and (2) what aspects impact the carbon emissions of training ML models? The study yielded several key findings. These include a decreasing proportion of carbon emissions-reporting models, a slight decrease in reported carbon footprint on Hugging Face over the past 2 years, and a continued dominance of NLP as the main application domain. Furthermore, the study uncovers correlations between carbon emissions and various attributes such as model size, dataset size, and ML application domains. These results highlight the need for software measurements to improve energy reporting practices and promote carbon-efficient model development within the Hugging Face community. In response to this issue, two classifications are proposed: one for categorizing models based on their carbon emission reporting practices and another for their carbon efficiency. The aim of these classification proposals is to foster transparency and sustainable model development within the ML community.

翻译：机器学习(ML)系统的兴起因其不断增强的能力和模型规模而加剧了其碳足迹。然而，关于ML模型碳足迹实际如何测量、报告和评估的知识仍然匮乏。鉴于此，本文旨在分析Hugging Face上1,417个ML模型及相关数据集的碳足迹测量情况，Hugging Face是最流行的预训练ML模型仓库。目标是为如何报告和优化ML模型的碳效率提供洞见与建议。本研究首次基于Hugging Face Hub API针对碳排放进行了仓库挖掘研究。研究旨在回答两个问题：(1) ML模型创建者如何在Hugging Face Hub上测量和报告碳排放？(2) 哪些方面影响ML模型训练的碳排放？研究得出若干关键发现，包括报告碳排放的模型比例呈下降趋势，过去两年Hugging Face上报告的碳足迹略有减少，以及NLP作为主要应用领域的持续主导地位。此外，研究揭示了碳排放与模型规模、数据集大小及ML应用领域等属性之间的相关性。这些结果凸显了利用软件测量改进能源报告实践、促进Hugging Face社区碳高效模型开发的必要性。针对此问题，本文提出了两种分类：一种用于基于碳排放报告实践对模型进行分类，另一种用于基于碳效率进行分类。这些分类提案旨在促进ML社区的透明度和可持续模型开发。