Hugging Face (HF) has established itself as a crucial platform for the development and sharing of machine learning (ML) models. This repository mining study, which delves into more than 380,000 models using data gathered via the HF Hub API, aims to explore the community engagement, evolution, and maintenance around models hosted on HF, aspects that have yet to be comprehensively explored in the literature. We first examine the overall growth and popularity of HF, uncovering trends in ML domains, framework usage, authors grouping and the evolution of tags and datasets used. Through text analysis of model card descriptions, we also seek to identify prevalent themes and insights within the developer community. Our investigation further extends to the maintenance aspects of models, where we evaluate the maintenance status of ML models, classify commit messages into various categories (corrective, perfective, and adaptive), analyze the evolution across development stages of commits metrics and introduce a new classification system that estimates the maintenance status of models based on multiple attributes. This study aims to provide valuable insights about ML model maintenance and evolution that could inform future model development strategies on platforms like HF.
翻译:Hugging Face(HF)已成为机器学习模型开发与共享的关键平台。本研究基于通过HF Hub API收集的超过38万个模型数据,开展仓库挖掘分析,旨在探究HF平台上模型的社区参与度、演进过程及维护情况,而现有文献尚未对此进行系统性的深入探讨。我们首先考察了HF的整体增长趋势与流行程度,揭示了机器学习领域、框架使用、作者群体以及标签和数据集演进的动态。通过模型卡片描述的文本分析,我们进一步识别了开发者社区中的主流主题与关键洞察。研究还延伸至模型维护层面:评估了机器学习模型的维护状态,将提交消息分为纠错型、完善型和适应型三类,分析了提交指标在开发阶段的演进规律,并引入一种基于多属性评估模型维护状态的新型分类体系。本研究旨在为ML模型的维护与演进提供重要见解,可为未来在HF等平台上的模型开发策略提供参考依据。