Code comments serve a crucial role in software development for documenting functionality, clarifying design choices, and assisting with issue tracking. They capture developers' insights about the surrounding source code, serving as an essential resource for both human comprehension and automated analysis. Nevertheless, since comments are in natural language, they present challenges for machine-based code understanding. To address this, recent studies have applied natural language processing (NLP) and deep learning techniques to classify comments according to developers' intentions. However, existing datasets for this task suffer from size limitations and class imbalance, as they rely on manual annotations and may not accurately represent the distribution of comments in real-world codebases. To overcome this issue, we introduce new synthetic oversampling and augmentation techniques based on high-quality data generation to enhance the NLBSE'26 challenge datasets. Our Synthetic Quality Oversampling Technique and Augmentation Technique (Q-SYNTH) yield promising results, improving the base classifier by $2.56\%$.
翻译:代码注释在软件开发中发挥着关键作用,用于记录功能、阐明设计选择并协助问题追踪。它们捕捉了开发者对相关源代码的见解,既是人类理解也是自动化分析的重要资源。然而,由于注释采用自然语言形式,这给基于机器的代码理解带来了挑战。为解决这一问题,近期研究应用自然语言处理(NLP)和深度学习技术,根据开发者意图对注释进行分类。然而,现有用于此任务的数据集存在规模限制和类别不平衡问题,因为它们依赖人工标注,且可能无法准确反映真实代码库中注释的分布情况。为克服此问题,我们引入了基于高质量数据生成的新型合成过采样与增强技术,以改进NLBSE'26挑战数据集。我们提出的合成质量过采样与增强技术(Q-SYNTH)取得了显著成果,将基线分类器的性能提升了$2.56\%$。