Background: Open-Source Pre-Trained Models (PTMs) and datasets provide extensive resources for various Machine Learning (ML) tasks, yet these resources lack a classification tailored to Software Engineering (SE) needs. Aims: We apply an SE-oriented classification to PTMs and datasets on a popular open-source ML repository, Hugging Face (HF), and analyze the evolution of PTMs over time. Method: We conducted a repository mining study. We started with a systematically gathered database of PTMs and datasets from the HF API. Our selection was refined by analyzing model and dataset cards and metadata, such as tags, and confirming SE relevance using Gemini 1.5 Pro. All analyses are replicable, with a publicly accessible replication package. Results: The most common SE task among PTMs and datasets is code generation, with a primary focus on software development and limited attention to software management. Popular PTMs and datasets mainly target software development. Among ML tasks, text generation is the most common in SE PTMs and datasets. There has been a marked increase in PTMs for SE since 2023 Q2. Conclusions: This study underscores the need for broader task coverage to enhance the integration of ML within SE practices.
翻译:背景:开源预训练模型(PTMs)与数据集为各类机器学习(ML)任务提供了丰富的资源,然而这些资源目前缺乏针对软件工程(SE)需求的专业分类。目标:本研究对热门开源ML平台Hugging Face(HF)上的PTMs与数据集应用SE导向的分类体系,并分析PTMs随时间的演变趋势。方法:我们开展了一项代码库挖掘研究。首先通过HF API系统性地收集PTMs与数据集构建初始数据库,随后通过分析模型/数据集卡片及标签等元数据,并利用Gemini 1.5 Pro确认SE相关性,对样本进行筛选优化。所有分析均可复现,并已提供公开可访问的复现资源包。结果:在PTMs与数据集中,最常见的SE任务是代码生成,主要关注软件开发领域,对软件管理任务的关注相对有限。热门的PTMs与数据集主要面向软件开发场景。在ML任务类型中,文本生成是SE相关PTMs与数据集中最普遍的任务类型。自2023年第二季度以来,面向SE的PTMs数量呈现显著增长。结论:本研究强调需要拓展任务覆盖范围,以促进ML在SE实践中的深度融合。