How do Machine Learning Models Change?

The proliferation of Machine Learning (ML) models and their open-source implementations has transformed Artificial Intelligence research and applications. Platforms like Hugging Face (HF) enable the development, sharing, and deployment of these models, fostering an evolving ecosystem. While previous studies have examined aspects of models hosted on platforms like HF, a comprehensive longitudinal study of how these models change remains underexplored. This study addresses this gap by utilizing both repository mining and longitudinal analysis methods to examine over 200,000 commits and 1,200 releases from over 50,000 models on HF. We replicate and extend an ML change taxonomy for classifying commits and utilize Bayesian networks to uncover patterns in commit and release activities over time. Our findings indicate that commit activities align with established data science methodologies, such as CRISP-DM, emphasizing iterative refinement and continuous improvement. Additionally, release patterns tend to consolidate significant updates, particularly in documentation, distinguishing between granular changes and milestone-based releases. Furthermore, projects with higher popularity prioritize infrastructure enhancements early in their lifecycle, and those with intensive collaboration practices exhibit improved documentation standards. These and other insights enhance the understanding of model changes on community platforms and provide valuable guidance for best practices in model maintenance.

翻译：机器学习（ML）模型及其开源实现的广泛普及已深刻改变了人工智能研究与应用格局。以Hugging Face（HF）为代表的平台促进了这些模型的开发、共享与部署，从而培育了一个持续演进的生态系统。尽管先前研究已对HF等平台托管的模型进行了多维度考察，但关于这些模型如何随时间演化的全面纵向研究仍显不足。本研究通过结合代码库挖掘与纵向分析方法，填补了这一空白：我们分析了HF平台上超过50,000个模型的200,000余次提交记录与1,200个版本发布。我们复现并扩展了用于提交分类的机器学习变更分类体系，并利用贝叶斯网络揭示了提交与发布活动随时间推移的模式规律。研究发现：提交活动遵循CRISP-DM等经典数据科学方法论框架，强调迭代优化与持续改进；发布模式倾向于整合重要更新（尤其是文档更新），体现了细粒度变更与里程碑式发布的区别；高关注度项目在其生命周期早期更注重基础设施升级，而采用深度协作实践的项目则展现出更完善的文档规范。这些发现深化了对社区平台模型演化规律的理解，为模型维护的最佳实践提供了重要指导。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日