Context and motivation: Recently, Large Language Models (LLMs) like ChatGPT have demonstrated remarkable proficiency in various Natural Language Processing (NLP) tasks. Their application in Requirements Engineering (RE), especially in requirements classification, has gained increasing interest. Question/problem: In our research, we conducted an extensive empirical evaluation of ChatGPT models including text-davinci-003, gpt-3.5-turbo, and gpt-4 in both zero-shot and few-shot settings for requirements classification. The question arises as to how these models compare to traditional classification methods, specifically Support Vector Machine (SVM) and Long Short-Term Memory (LSTM). Principal ideas/results: Based on five diverse datasets, our results show that ChatGPT consistently outperforms LSTM, and while ChatGPT is more effective than SVM in classifying functional requirements (FR), SVM is better in classifying non-functional requirements (NFR). Our results also show that contrary to our expectations, the few-shot setting does not always lead to enhanced performance; in most instances, it was found to be suboptimal. Contribution: Our findings underscore the potential of LLMs in the RE domain, suggesting that they could play a pivotal role in future software engineering processes, particularly as tools to enhance requirements classification.
翻译:背景与动机:近年来,ChatGPT等大型语言模型在多种自然语言处理任务中展现出卓越能力,其在需求工程领域尤其是需求分类中的应用正引发日益广泛的关注。问题/挑战:本研究对ChatGPT系列模型(包括text-davinci-003、gpt-3.5-turbo及gpt-4)在零样本与少样本设定下进行了大规模实证评估,重点关注需求分类任务。核心问题在于:这些模型与传统分类方法(支持向量机SVM和长短期记忆网络LSTM)相比表现如何?主要思路/结果:基于五个不同数据集,实验结果表明ChatGPT始终优于LSTM;在功能需求分类方面ChatGPT比SVM更有效,但在非功能需求分类中SVM表现更优。此外,与预期相反,少样本设定并非总能提升性能,在多数情况下反而效果欠佳。贡献:本研究揭示了大型语言模型在需求工程领域的应用潜力,表明其可作为增强需求分类的关键工具,在未来软件工程流程中扮演重要角色。