Integrating Artificial Intelligence into Software Engineering (SE) requires having a curated collection of models suited to SE tasks. With millions of models hosted on Hugging Face (HF) and new ones continuously being created, it is infeasible to identify SE models without a dedicated catalogue. To address this gap, we present SEMODS: an SE-focused dataset of 3,427 models extracted from HF, combining automated collection with rigorous validation through manual annotation and large language model assistance. Our dataset links models to SE tasks and activities from the software development lifecycle, offering a standardized representation of their evaluation results, and supporting multiple applications such as data analysis, model discovery, benchmarking, and model adaptation.
翻译:将人工智能融入软件工程需要拥有一个适用于软件工程任务的精选模型集合。Hugging Face 上托管着数百万个模型,且新模型不断涌现,若没有专门的目录,识别软件工程模型几乎不可行。为填补这一空白,我们提出了 SEMODS:一个专注于软件工程的数据集,包含从 Hugging Face 提取的 3,427 个模型,其构建结合了自动化收集与通过人工标注和大语言模型辅助的严格验证。我们的数据集将模型与软件开发生命周期中的软件工程任务和活动相关联,提供了其评估结果的标准化表示,并支持数据分析、模型发现、基准测试和模型适配等多种应用。