This paper explores the use of text data augmentation techniques to enhance conflict and duplicate detection in software engineering tasks through sentence pair classification. The study adapts generic augmentation techniques such as shuffling, back translation, and paraphrasing and proposes new data augmentation techniques such as Noun-Verb Substitution, target-lemma replacement and Actor-Action Substitution for software requirement texts. A comprehensive empirical analysis is conducted on six software text datasets to identify conflicts and duplicates among sentence pairs. The results demonstrate that data augmentation techniques have a significant impact on the performance of all software pair text datasets. On the other hand, in cases where the datasets are relatively balanced, the use of augmentation techniques may result in a negative effect on the classification performance.
翻译:本文探索了通过句子对分类任务,运用文本数据增强技术提升软件工程任务中冲突与重复检测效果的方法。研究将通用增强技术(如洗牌、反向翻译和释义)进行适配,并提出了针对软件需求文本的新增强技术,包括名词-动词替换、目标词元替换及动作主体替换。我们对六个软件文本数据集进行全面的实证分析,以识别句子对中的冲突与重复。结果表明,数据增强技术对所有软件对文本数据集的性能具有显著影响。另一方面,在数据集相对平衡的情况下,使用增强技术可能会对分类性能产生负面影响。