Exploring the Carbon Footprint of Hugging Face's ML Models: A Repository Mining Study

The rise of machine learning (ML) systems has exacerbated their carbon footprint due to increased capabilities and model sizes. However, there is scarce knowledge on how the carbon footprint of ML models is actually measured, reported, and evaluated. In light of this, the paper aims to analyze the measurement of the carbon footprint of 1,417 ML models and associated datasets on Hugging Face, which is the most popular repository for pretrained ML models. The goal is to provide insights and recommendations on how to report and optimize the carbon efficiency of ML models. The study includes the first repository mining study on the Hugging Face Hub API on carbon emissions. This study seeks to answer two research questions: (1) how do ML model creators measure and report carbon emissions on Hugging Face Hub?, and (2) what aspects impact the carbon emissions of training ML models? The study yielded several key findings. These include a stalled proportion of carbon emissions-reporting models, a slight decrease in reported carbon footprint on Hugging Face over the past 2 years, and a continued dominance of NLP as the main application domain. Furthermore, the study uncovers correlations between carbon emissions and various attributes such as model size, dataset size, and ML application domains. These results highlight the need for software measurements to improve energy reporting practices and promote carbon-efficient model development within the Hugging Face community. In response to this issue, two classifications are proposed: one for categorizing models based on their carbon emission reporting practices and another for their carbon efficiency. The aim of these classification proposals is to foster transparency and sustainable model development within the ML community.

翻译：机器学习系统的兴起因其不断增强的能力和模型规模而加剧了其碳足迹。然而，目前对机器学习模型碳足迹实际测量、报告和评估方式的认识仍十分有限。鉴于此，本文旨在分析Hugging Face（最流行的预训练机器学习模型仓库）上1,417个机器学习模型及相关数据集的碳足迹测量情况，其目标是提供关于如何报告和优化机器学习模型碳效率的见解与建议。本研究是首个针对Hugging Face Hub API碳排放数据的仓库挖掘研究。该研究旨在回答两个研究问题：（1）机器学习模型创建者如何在Hugging Face Hub上测量和报告碳排放？（2）哪些因素影响机器学习模型训练的碳排放？研究得出了若干关键发现，包括报告碳排放的模型比例停滞不前、过去两年Hugging Face上报告的碳足迹略有下降，以及NLP作为主要应用领域持续占据主导地位。此外，研究揭示了碳排放与模型规模、数据集大小及机器学习应用领域等属性之间的相关性。这些结果凸显了通过软件测量改进能源报告实践、促进Hugging Face社区内碳高效模型开发的必要性。针对这一问题，本文提出了两种分类方案：一种用于根据碳排放报告实践对模型进行分类，另一种则用于对其碳效率进行分类。这些分类建议旨在促进机器学习社区内的透明度和可持续模型开发。