Automated categorization of pre-trained models for software engineering: A case study with a Hugging Face dataset

Software engineering (SE) activities have been revolutionized by the advent of pre-trained models (PTMs), defined as large machine learning (ML) models that can be fine-tuned to perform specific SE tasks. However, users with limited expertise may need help to select the appropriate model for their current task. To tackle the issue, the Hugging Face (HF) platform simplifies the use of PTMs by collecting, storing, and curating several models. Nevertheless, the platform currently lacks a comprehensive categorization of PTMs designed specifically for SE, i.e., the existing tags are more suited to generic ML categories. This paper introduces an approach to address this gap by enabling the automatic classification of PTMs for SE tasks. First, we utilize a public dump of HF to extract PTMs information, including model documentation and associated tags. Then, we employ a semi-automated method to identify SE tasks and their corresponding PTMs from existing literature. The approach involves creating an initial mapping between HF tags and specific SE tasks, using a similarity-based strategy to identify PTMs with relevant tags. The evaluation shows that model cards are informative enough to classify PTMs considering the pipeline tag. Moreover, we provide a mapping between SE tasks and stored PTMs by relying on model names.

翻译：预训练模型（PTMs）——即能够通过微调执行特定软件工程任务的大型机器学习模型——的出现彻底改变了软件工程活动。然而，专业知识有限的用户在选择适合其当前任务的模型时可能面临困难。为应对此问题，Hugging Face平台通过收集、存储和整理大量模型，简化了PTMs的使用流程。但该平台目前缺乏专门针对软件工程设计的PTMs系统分类体系，即现有标签更适用于通用机器学习分类。本文提出一种方法，通过实现软件工程任务中PTMs的自动分类来填补这一空白。首先，我们利用Hugging Face的公开数据转储提取PTMs信息，包括模型文档和相关标签。随后，采用半自动化方法从现有文献中识别软件工程任务及其对应的PTMs。该方法通过创建HF标签与特定软件工程任务间的初始映射，采用基于相似性的策略识别具有相关标签的PTMs。评估结果表明，在考虑pipeline标签时，模型卡片所提供的信息足以对PTMs进行分类。此外，我们依据模型名称建立了软件工程任务与存储PTMs之间的映射关系。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日