Open-Source Pre-Trained Models (PTMs) provide extensive resources for various Machine Learning (ML) tasks, yet these resources lack a classification tailored to Software Engineering (SE) needs. To address this gap, we derive a taxonomy encompassing 147 SE tasks and apply an SE-oriented classification to PTMs in a popular open-source ML repository, Hugging Face (HF). Our repository mining study began with a systematically gathered database of PTMs from the HF API, considering their model card descriptions and metadata, and the abstract of the associated arXiv papers. We confirmed SE relevance through multiple filtering steps: detecting outliers, identifying near-identical PTMs, and the use of Gemini 2.0 Flash, which was validated with five pilot studies involving three human annotators. This approach uncovered 2,205 SE PTMs. We find that code generation is the most common SE task among PTMs, primarily focusing on software implementation, while requirements engineering and software design activities receive limited attention. In terms of ML tasks, text generation dominates within SE PTMs. Notably, the number of SE PTMs has increased markedly since 2023 Q2. Our classification provides a solid foundation for future automated SE scenarios, such as the sampling and selection of suitable PTMs.
翻译:开源预训练模型(PTMs)为各类机器学习(ML)任务提供了丰富资源,但这些资源缺乏针对软件工程(SE)需求的专业分类。为填补这一空白,我们构建了一个涵盖147项SE任务的分类体系,并将其应用于流行开源ML仓库Hugging Face(HF)中的PTMs。我们的仓库挖掘研究始于通过HF API系统收集的PTM数据库,综合考虑了其模型卡片描述与元数据,以及相关arXiv论文的摘要。我们通过多级过滤步骤确认SE相关性:检测异常值、识别近似的PTM,并使用经过三项人工标注者参与的五项试点研究验证的Gemini 2.0 Flash模型。该方法最终识别出2,205个SE相关PTM。研究发现,代码生成是PTMs中最常见的SE任务,主要聚焦于软件实现,而需求工程与软件设计活动获得的关注有限。在ML任务层面,文本生成在SE PTMs中占据主导地位。值得注意的是,自2023年第二季度以来,SE相关PTM的数量显著增长。我们的分类体系为未来自动化SE场景(例如合适PTM的采样与选择)奠定了坚实基础。