App reviews reflect various user requirements that can aid in planning maintenance tasks. Recently, proposed approaches for automatically classifying user reviews rely on machine learning algorithms. Devine et al. demonstrated that models trained on existing labeled datasets exhibit poor performance when predicting new ones. Although integrating datasets improves the results to some extent, there is still a need for greater generalizability to be taken into consideration. Therefore, a comprehensive labeled dataset is essential to train a more precise model. This paper introduces an approach to train a more generalizable model by leveraging information from an additional source, such as the GitHub issue tracking system, that contains valuable information about user requirements. We propose an approach that assists in augmenting labeled datasets by utilizing information extracted from GitHub issues. First, we identify issues concerning review intentions (bug reports, feature requests, and others) by examining the issue labels. Then, we analyze issue bodies and define 19 language patterns for extracting targeted information. Finally, we augment the manually labeled review dataset with a subset of processed issues through the Within-App, Within-Context, and Between-App Analysis methods. The first two methods train the app-specific models, and the last suits the general-purpose models. We conducted several experiments to evaluate the proposed approach. Our results demonstrate that using labeled issues for data augmentation can improve the F1-score and recall to 13.9 and 29.9 in the bug reports, respectively, and to 7.5 and 13.5 for feature requests. Furthermore, we identify an effective volume range of 0.3 to 0.7, which provides better performance improvements.
翻译:应用评论反映了用户的各种需求,有助于规划维护任务。近期,基于机器学习算法自动分类用户评论的方法被相继提出。Devine等人证明,在现有标注数据集上训练的模型在预测新数据集时表现不佳。尽管整合数据集在一定程度上改善了结果,但仍需考虑更强的泛化能力。因此,一个全面的标注数据集对于训练更精确的模型至关重要。本文提出一种方法,通过利用辅助信息源(如包含用户需求宝贵信息的GitHub问题追踪系统)来训练更具泛化能力的模型。我们提出一种方法,借助从GitHub Issues中提取的信息来扩充标注数据集。首先,通过检查问题标签识别与评论意图(如缺陷报告、功能请求及其他)相关的问题。然后,分析问题正文并定义19种语言模式以提取目标信息。最后,通过应用内分析、上下文内分析和应用间分析方法,将经过处理的问题子集扩充到人工标注的评论数据集中。前两种方法用于训练特定应用模型,后一种方法适用于通用模型。我们进行了多项实验评估所提方法。结果表明,在缺陷报告方面,使用标注问题进行数据增强可使F1分数和召回率分别提升至13.9和29.9;在功能请求方面,分别提升至7.5和13.5。此外,我们确定了0.3至0.7的有效数据量范围,可带来更佳的性能提升。