预训练模型如何支持软件工程？基于Hugging Face的实证研究 (How do Pre-Trained Models Support Software Engineering? An Empirical Study in Hugging Face)

Open-Source Pre-Trained Models (PTMs) provide extensive resources for various Machine Learning (ML) tasks, yet these resources lack a classification tailored to Software Engineering (SE) needs. To address this gap, we derive a taxonomy encompassing 147 SE tasks and apply an SE-oriented classification to PTMs in a popular open-source ML repository, Hugging Face (HF). Our repository mining study began with a systematically gathered database of PTMs from the HF API, considering their model card descriptions and metadata, and the abstract of the associated arXiv papers. We confirmed SE relevance through multiple filtering steps: detecting outliers, identifying near-identical PTMs, and the use of Gemini 2.0 Flash, which was validated with five pilot studies involving three human annotators. This approach uncovered 2,205 SE PTMs. We find that code generation is the most common SE task among PTMs, primarily focusing on software implementation, while requirements engineering and software design activities receive limited attention. In terms of ML tasks, text generation dominates within SE PTMs. Notably, the number of SE PTMs has increased markedly since 2023 Q2. Our classification provides a solid foundation for future automated SE scenarios, such as the sampling and selection of suitable PTMs.

翻译：开源预训练模型（PTMs）为各类机器学习（ML）任务提供了丰富资源，但这些资源缺乏针对软件工程（SE）需求的专业分类。为填补这一空白，我们构建了一个涵盖147项SE任务的分类体系，并将其应用于流行开源ML仓库Hugging Face（HF）中的PTMs。我们的仓库挖掘研究始于通过HF API系统收集的PTM数据库，综合考虑了其模型卡片描述与元数据，以及相关arXiv论文的摘要。我们通过多级过滤步骤确认SE相关性：检测异常值、识别近似的PTM，并使用经过三项人工标注者参与的五项试点研究验证的Gemini 2.0 Flash模型。该方法最终识别出2,205个SE相关PTM。研究发现，代码生成是PTMs中最常见的SE任务，主要聚焦于软件实现，而需求工程与软件设计活动获得的关注有限。在ML任务层面，文本生成在SE PTMs中占据主导地位。值得注意的是，自2023年第二季度以来，SE相关PTM的数量显著增长。我们的分类体系为未来自动化SE场景（例如合适PTM的采样与选择）奠定了坚实基础。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日