机器学习模型与数据集文档、供应链及许可挑战在Hugging Face平台的实证分析 (An Empirical Analysis of Machine Learning Model and Dataset Documentation, Supply Chain, and Licensing Challenges on Hugging Face)

The last decade has seen widespread adoption of Machine Learning (ML) components in software systems. This has occurred in nearly every domain, from natural language processing to computer vision. These ML components range from relatively simple neural networks to complex and resource-intensive large language models. However, despite this widespread adoption, little is known about the supply chain relationships that produce these models, which can have implications for compliance and security. In this work, we conducted an extensive analysis of 760,460 models and 175,000 datasets extracted from the popular model-sharing site Hugging Face. First, we evaluate the current state of documentation in the Hugging Face supply chain, report real-world examples of shortcomings, and offer actionable suggestions for improvement. Next, we analyze the underlying structure of the existing supply chain. Finally, we explore the current licensing landscape against what was reported in previous work and discuss the unique challenges posed in this domain. Our results motivate multiple research avenues, including the need for better license management for ML models/datasets, better support for model documentation, and automated inconsistency checking and validation. We make our research infrastructure and dataset available to facilitate future research.

翻译：过去十年间，机器学习（ML）组件在软件系统中得到广泛采用，其应用几乎覆盖从自然语言处理到计算机视觉的所有领域。这些ML组件涵盖从相对简单的神经网络到复杂且资源密集型的大型语言模型。然而，尽管应用广泛，人们对产生这些模型的供应链关系仍知之甚少，这可能对合规性与安全性产生影响。本研究对从热门模型共享平台Hugging Face提取的760,460个模型与175,000个数据集进行了全面分析。首先，我们评估了Hugging Face供应链中文档记录的现状，报告实际存在的缺陷案例，并提出可操作的改进建议。其次，我们解析了现有供应链的底层结构。最后，我们对比先前研究报道的许可现状，探讨该领域特有的挑战。我们的研究结果揭示了多个需要探索的方向，包括：ML模型/数据集需改进许可管理、需加强模型文档支持，以及需建立自动化不一致性检查与验证机制。我们公开研究基础设施与数据集以促进后续研究。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日