Developing and training deep learning models is expensive, so software engineers have begun to reuse pre-trained deep learning models (PTMs) and fine-tune them for downstream tasks. Despite the wide-spread use of PTMs, we know little about the corresponding software engineering behaviors and challenges. To enable the study of software engineering with PTMs, we present the PeaTMOSS dataset: Pre-Trained Models in Open-Source Software. PeaTMOSS has three parts: a snapshot of (1) 281,638 PTMs, (2) 27,270 open-source software repositories that use PTMs, and (3) a mapping between PTMs and the projects that use them. We challenge PeaTMOSS miners to discover software engineering practices around PTMs. A demo and link to the full dataset are available at: https://github.com/PurdueDualityLab/PeaTMOSS-Demos.
翻译:训练深度学习模型成本高昂,因此软件工程师已开始重用预训练深度学习模型(PTM)并针对下游任务对其进行微调。尽管PTM已被广泛使用,但我们对相应的软件工程行为与挑战知之甚少。为支持基于PTM的软件工程研究,我们提出了PeaTMOSS数据集:开源软件中的预训练模型。该数据集包含三部分快照:(1)281,638个PTM,(2)27,270个使用PTM的开源软件仓库,以及(3)PTM与其使用项目之间的映射关系。我们邀请PeaTMOSS挖掘者探索围绕PTM的软件工程实践。演示及完整数据集链接见:https://github.com/PurdueDualityLab/PeaTMOSS-Demos。