The proliferation of Machine Learning (ML) models and their open-source implementations has transformed Artificial Intelligence research and applications. Platforms like Hugging Face (HF) enable the development, sharing, and deployment of these models, fostering an evolving ecosystem. While previous studies have examined aspects of models hosted on platforms like HF, a comprehensive longitudinal study of how these models change remains underexplored. This study addresses this gap by utilizing both repository mining and longitudinal analysis methods to examine over 200,000 commits and 1,200 releases from over 50,000 models on HF. We replicate and extend an ML change taxonomy for classifying commits and utilize Bayesian networks to uncover patterns in commit and release activities over time. Our findings indicate that commit activities align with established data science methodologies, such as CRISP-DM, emphasizing iterative refinement and continuous improvement. Additionally, release patterns tend to consolidate significant updates, particularly in documentation, distinguishing between granular changes and milestone-based releases. Furthermore, projects with higher popularity prioritize infrastructure enhancements early in their lifecycle, and those with intensive collaboration practices exhibit improved documentation standards. These and other insights enhance the understanding of model changes on community platforms and provide valuable guidance for best practices in model maintenance.
翻译:机器学习(ML)模型及其开源实现的广泛普及已深刻改变了人工智能研究与应用格局。以Hugging Face(HF)为代表的平台促进了这些模型的开发、共享与部署,从而培育了一个持续演进的生态系统。尽管先前研究已对HF等平台托管的模型进行了多维度考察,但关于这些模型如何随时间演化的全面纵向研究仍显不足。本研究通过结合代码库挖掘与纵向分析方法,填补了这一空白:我们分析了HF平台上超过50,000个模型的200,000余次提交记录与1,200个版本发布。我们复现并扩展了用于提交分类的机器学习变更分类体系,并利用贝叶斯网络揭示了提交与发布活动随时间推移的模式规律。研究发现:提交活动遵循CRISP-DM等经典数据科学方法论框架,强调迭代优化与持续改进;发布模式倾向于整合重要更新(尤其是文档更新),体现了细粒度变更与里程碑式发布的区别;高关注度项目在其生命周期早期更注重基础设施升级,而采用深度协作实践的项目则展现出更完善的文档规范。这些发现深化了对社区平台模型演化规律的理解,为模型维护的最佳实践提供了重要指导。