The last decade has seen widespread adoption of Machine Learning (ML) components in software systems. This has occurred in nearly every domain, from natural language processing to computer vision. These ML components range from relatively simple neural networks to complex and resource-intensive large language models. However, despite this widespread adoption, little is known about the supply chain relationships that produce these models, which can have implications for compliance and security. In this work, we conducted an extensive analysis of 760,460 models and 175,000 datasets extracted from the popular model-sharing site Hugging Face. First, we evaluate the current state of documentation in the Hugging Face supply chain, report real-world examples of shortcomings, and offer actionable suggestions for improvement. Next, we analyze the underlying structure of the existing supply chain. Finally, we explore the current licensing landscape against what was reported in previous work and discuss the unique challenges posed in this domain. Our results motivate multiple research avenues, including the need for better license management for ML models/datasets, better support for model documentation, and automated inconsistency checking and validation. We make our research infrastructure and dataset available to facilitate future research.
翻译:过去十年间,机器学习(ML)组件在软件系统中得到广泛采用,其应用几乎覆盖从自然语言处理到计算机视觉的所有领域。这些ML组件涵盖从相对简单的神经网络到复杂且资源密集型的大型语言模型。然而,尽管应用广泛,人们对产生这些模型的供应链关系仍知之甚少,这可能对合规性与安全性产生影响。本研究对从热门模型共享平台Hugging Face提取的760,460个模型与175,000个数据集进行了全面分析。首先,我们评估了Hugging Face供应链中文档记录的现状,报告实际存在的缺陷案例,并提出可操作的改进建议。其次,我们解析了现有供应链的底层结构。最后,我们对比先前研究报道的许可现状,探讨该领域特有的挑战。我们的研究结果揭示了多个需要探索的方向,包括:ML模型/数据集需改进许可管理、需加强模型文档支持,以及需建立自动化不一致性检查与验证机制。我们公开研究基础设施与数据集以促进后续研究。