Software engineering (SE) activities have been revolutionized by the advent of pre-trained models (PTMs), defined as large machine learning (ML) models that can be fine-tuned to perform specific SE tasks. However, users with limited expertise may need help to select the appropriate model for their current task. To tackle the issue, the Hugging Face (HF) platform simplifies the use of PTMs by collecting, storing, and curating several models. Nevertheless, the platform currently lacks a comprehensive categorization of PTMs designed specifically for SE, i.e., the existing tags are more suited to generic ML categories. This paper introduces an approach to address this gap by enabling the automatic classification of PTMs for SE tasks. First, we utilize a public dump of HF to extract PTMs information, including model documentation and associated tags. Then, we employ a semi-automated method to identify SE tasks and their corresponding PTMs from existing literature. The approach involves creating an initial mapping between HF tags and specific SE tasks, using a similarity-based strategy to identify PTMs with relevant tags. The evaluation shows that model cards are informative enough to classify PTMs considering the pipeline tag. Moreover, we provide a mapping between SE tasks and stored PTMs by relying on model names.
翻译:预训练模型(PTMs)——即能够通过微调执行特定软件工程任务的大型机器学习模型——的出现彻底改变了软件工程活动。然而,专业知识有限的用户在选择适合其当前任务的模型时可能面临困难。为应对此问题,Hugging Face平台通过收集、存储和整理大量模型,简化了PTMs的使用流程。但该平台目前缺乏专门针对软件工程设计的PTMs系统分类体系,即现有标签更适用于通用机器学习分类。本文提出一种方法,通过实现软件工程任务中PTMs的自动分类来填补这一空白。首先,我们利用Hugging Face的公开数据转储提取PTMs信息,包括模型文档和相关标签。随后,采用半自动化方法从现有文献中识别软件工程任务及其对应的PTMs。该方法通过创建HF标签与特定软件工程任务间的初始映射,采用基于相似性的策略识别具有相关标签的PTMs。评估结果表明,在考虑pipeline标签时,模型卡片所提供的信息足以对PTMs进行分类。此外,我们依据模型名称建立了软件工程任务与存储PTMs之间的映射关系。