The development and training of deep learning models have become increasingly costly and complex. Consequently, software engineers are adopting pre-trained models (PTMs) for their downstream applications. The dynamics of the PTM supply chain remain largely unexplored, signaling a clear need for structured datasets that document not only the metadata but also the subsequent applications of these models. Without such data, the MSR community cannot comprehensively understand the impact of PTM adoption and reuse. This paper presents the PeaTMOSS dataset, which comprises metadata for 281,638 PTMs and detailed snapshots for all PTMs with over 50 monthly downloads (14,296 PTMs), along with 28,575 open-source software repositories from GitHub that utilize these models. Additionally, the dataset includes 44,337 mappings from 15,129 downstream GitHub repositories to the 2,530 PTMs they use. To enhance the dataset's comprehensiveness, we developed prompts for a large language model to automatically extract model metadata, including the model's training datasets, parameters, and evaluation metrics. Our analysis of this dataset provides the first summary statistics for the PTM supply chain, showing the trend of PTM development and common shortcomings of PTM package documentation. Our example application reveals inconsistencies in software licenses across PTMs and their dependent projects. PeaTMOSS lays the foundation for future research, offering rich opportunities to investigate the PTM supply chain. We outline mining opportunities on PTMs, their downstream usage, and cross-cutting questions.
翻译:深度学习模型的开发与训练日益昂贵且复杂。因此,软件工程师开始在其下游应用中采用预训练模型(PTM)。PTM供应链的动态机制仍基本未得到探索,这表明亟需构建结构化数据集,不仅记录元数据,还需涵盖这些模型的后续应用。缺乏此类数据,MSR社区便无法全面理解PTM采纳与复用的影响。本文提出了PeaTMOSS数据集,该数据集包含281,638个PTM的元数据,以及所有月下载量超过50次的PTM(共14,296个)的详细快照,同时包含GitHub上使用这些模型的28,575个开源软件仓库。此外,该数据集还包含从15,129个下游GitHub仓库到其所使用的2,530个PTM的44,337个映射关系。为提升数据集的全面性,我们设计了大语言模型提示,用于自动提取模型元数据,包括模型的训练数据集、参数和评估指标。基于该数据集的分析首次提供了PTM供应链的汇总统计,揭示了PTM开发趋势及PTM包文档的常见缺陷。我们的示例应用发现了PTM与其依赖项目之间软件许可证不一致的问题。PeaTMOSS为未来研究奠定了基础,为探究PTM供应链提供了丰富的研究机会。我们概述了围绕PTM、其下游应用及跨领域问题的挖掘机遇。