SACS: A Code Smell Dataset using Semi-automatic Generation Approach

Code smell is a great challenge in software refactoring, which indicates latent design or implementation flaws that may degrade the software maintainability and evolution. Over the past of decades, the research on code smell has received extensive attention. Especially the researches applied machine learning-technique have become a popular topic in recent studies. However, one of the biggest challenges to apply machine learning-technique is the lack of high-quality code smell datasets. Manually constructing such datasets is extremely labor-intensive, as identifying code smells requires substantial development expertise and considerable time investment. In contrast, automatically generated datasets, while scalable, frequently exhibit reduced label reliability and compromised data quality. To overcome this challenge, in this study, we explore a semi-automatic approach to generate a code smell dataset with high quality data samples. Specifically, we first applied a set of automatic generation rules to produce candidate smelly samples. We then employed multiple metrics to group the data samples into an automatically accepted group and a manually reviewed group, enabling reviewers to concentrate their efforts on ambiguous samples. Furthermore, we established structured review guidelines and developed a annotation tool to support the manual validation process. Based on the proposed semi-automatic generation approach, we created an open-source code smell dataset, SACS, covering three widely studied code smells: Long Method, Large Class, and Feature Envy. Each code smell category includes over 10,000 labeled samples. This dataset could provide a large-scale and publicly available benchmark to facilitate future studies on code smell detection and automated refactoring.

翻译：代码异味是软件重构中的重大挑战，其暗示着可能降低软件可维护性与演化性的潜在设计或实现缺陷。过去数十年来，代码异味研究受到广泛关注，特别是应用机器学习技术的研究已成为近年来的热点课题。然而，应用机器学习技术面临的最大挑战之一是缺乏高质量的代码异味数据集。手动构建此类数据集极其耗费人力，因为识别代码异味需要大量开发专业知识与可观的时间投入。相比之下，自动生成的数据集虽具有可扩展性，却常存在标签可靠性降低与数据质量受损的问题。为克服这一挑战，本研究探索了一种半自动生成方法，以构建具有高质量数据样本的代码异味数据集。具体而言，我们首先应用一套自动生成规则产生候选异味样本，随后采用多项度量指标将数据样本划分为自动接受组与人工审核组，使审核人员能够集中精力处理模糊样本。此外，我们建立了结构化审核指南并开发了标注工具以支持人工验证流程。基于所提出的半自动生成方法，我们创建了开源代码异味数据集SACS，涵盖三种被广泛研究的代码异味：长方法、大类与特性依恋。每个代码异味类别均包含超过10,000个标注样本。该数据集可为代码异味检测与自动化重构的未来研究提供大规模、公开可用的基准。