Context: Classification of software requirements into different categories is a critically important task in requirements engineering (RE). Developing machine learning (ML) approaches for requirements classification has attracted great interest in the RE community since the 2000s. Objective: This paper aims to address two related problems that have been challenging real-world applications of ML approaches: the problems of class imbalance and high dimensionality with low sample size data (HDLSS). These problems can greatly degrade the classification performance of ML methods. Method: The paper proposes HC4RC, a novel ML approach for multiclass classification of requirements. HC4RC solves the aforementioned problems through semantic-role-based feature selection, dataset decomposition and hierarchical classification. We experimentally compare the effectiveness of HC4RC with three closely related approaches - two of which are based on a traditional statistical classification model whereas one uses an advanced deep learning model. Results: Our experiment shows: 1) The class imbalance and HDLSS problems present a challenge to both traditional and advanced ML approaches. 2) The HC4RC approach is simple to use and can effectively address the class imbalance and HDLSS problems compared to similar approaches. Conclusion: This paper makes an important practical contribution to addressing the class imbalance and HDLSS problems in multiclass classification of software requirements.
翻译:上下文:将软件需求分类到不同类别是需求工程(RE)中一项至关重要的任务。自2000年代以来,开发用于需求分类的机器学习(ML)方法引起了RE社区的极大兴趣。目标:本文旨在解决两个阻碍ML方法在现实世界中应用的相关问题:类别不平衡问题和高维小样本数据(HDLSS)问题。这些问题会严重降低ML方法的分类性能。方法:本文提出HC4RC,一种用于需求多类分类的新型ML方法。HC4RC通过基于语义角色的特征选择、数据集分解和层次分类来解决上述问题。我们通过实验将HC4RC的有效性与三种密切相关的方法进行比较——其中两种基于传统统计分类模型,一种使用先进的深度学习模型。结果:实验表明:1) 类别不平衡和HDLSS问题对传统和先进的ML方法均构成挑战。2) 与类似方法相比,HC4RC方法易于使用,并且能够有效解决类别不平衡和HDLSS问题。结论:本文为解决软件需求多类分类中的类别不平衡和HDLSS问题做出了重要的实践贡献。