Stop words, which are considered non-predictive, are often eliminated in natural language processing tasks. However, the definition of uninformative vocabulary is vague, so most algorithms use general knowledge-based stop lists to remove stop words. There is an ongoing debate among academics about the usefulness of stop word elimination, especially in domain-specific settings. In this work, we investigate the usefulness of stop word removal in a software engineering context. To do this, we replicate and experiment with three software engineering research tools from related work. Additionally, we construct a corpus of software engineering domain-related text from 10,000 Stack Overflow questions and identify 200 domain-specific stop words using traditional information-theoretic methods. Our results show that the use of domain-specific stop words significantly improved the performance of research tools compared to the use of a general stop list and that 17 out of 19 evaluation measures showed better performance. Online appendix: https://zenodo.org/record/7865748
翻译:停用词被认为不具备预测性,通常在自然语言处理任务中被剔除。然而,无信息词汇的定义模糊不清,因此大多数算法基于通用知识库的停用词列表来移除停用词。学术界对于停用词剔除的有效性仍存在争议,尤其是在特定领域场景中。本研究在软件工程背景下探讨停用词移除的效用。为此,我们复现并测试了相关工作中的三个软件工程研究工具。此外,我们从10,000个Stack Overflow问题中构建了软件工程领域相关文本语料库,并利用传统信息论方法识别出200个领域特定停用词。结果表明,与使用通用停用词列表相比,使用领域特定停用词显著提升了研究工具的性能,19项评估指标中有17项表现更优。在线附录:https://zenodo.org/record/7865748