生成式语言模型在需求工程应用中的潜力：当前优势与局限性的深入探讨 (Generative Language Models Potential for Requirement Engineering Applications: Insights into Current Strengths and Limitations)

Traditional language models have been extensively evaluated for software engineering domain, however the potential of ChatGPT and Gemini have not been fully explored. To fulfill this gap, the paper in hand presents a comprehensive case study to investigate the potential of both language models for development of diverse types of requirement engineering applications. It deeply explores impact of varying levels of expert knowledge prompts on the prediction accuracies of both language models. Across 4 different public benchmark datasets of requirement engineering tasks, it compares performance of both language models with existing task specific machine/deep learning predictors and traditional language models. Specifically, the paper utilizes 4 benchmark datasets; Pure (7,445 samples, requirements extraction),PROMISE (622 samples, requirements classification), REQuestA (300 question answer (QA) pairs) and Aerospace datasets (6347 words, requirements NER tagging). Our experiments reveal that, in comparison to ChatGPT, Gemini requires more careful prompt engineering to provide accurate predictions. Moreover, across requirement extraction benchmark dataset the state-of-the-art F1-score is 0.86 while ChatGPT and Gemini achieved 0.76 and 0.77,respectively. The State-of-the-art F1-score on requirements classification dataset is 0.96 and both language models 0.78. In name entity recognition (NER) task the state-of-the-art F1-score is 0.92 and ChatGPT managed to produce 0.36, and Gemini 0.25. Similarly, across question answering dataset the state-of-the-art F1-score is 0.90 and ChatGPT and Gemini managed to produce 0.91 and 0.88 respectively. Our experiments show that Gemini requires more precise prompt engineering than ChatGPT. Except for question-answering, both models under-perform compared to current state-of-the-art predictors across other tasks.

翻译：传统语言模型已在软件工程领域得到广泛评估，然而ChatGPT和Gemini的潜力尚未被充分探索。为填补这一空白，本文通过一项综合性案例研究，探讨这两种语言模型在开发各类需求工程应用中的潜力。研究深入探究了不同层次专家知识提示对两种语言模型预测准确性的影响。基于需求工程任务的4个不同公共基准数据集，本研究将两种语言模型的性能与现有任务特定机器学习/深度学习预测器及传统语言模型进行了比较。具体而言，论文使用了4个基准数据集：Pure数据集（7,445个样本，需求提取）、PROMISE数据集（622个样本，需求分类）、REQuestA数据集（300个问答对）以及航空航天数据集（6347个单词，需求命名实体识别标注）。实验结果表明，与ChatGPT相比，Gemini需要更精细的提示工程才能提供准确预测。此外，在需求提取基准数据集上，当前最优F1分数为0.86，而ChatGPT和Gemini分别达到0.76和0.77。在需求分类数据集上，当前最优F1分数为0.96，两种语言模型均为0.78。在命名实体识别任务中，当前最优F1分数为0.92，ChatGPT获得0.36，Gemini获得0.25。在问答数据集上，当前最优F1分数为0.90，ChatGPT和Gemini分别达到0.91和0.88。我们的实验表明，Gemini比ChatGPT需要更精确的提示工程。除问答任务外，这两种模型在其他任务上的表现均逊于当前最优预测器。

相关内容

Engineering

关注 6

《工程》是中国工程院（CAE）于2015年推出的国际开放存取期刊。其目的是提供一个高水平的平台，传播和分享工程研发的前沿进展、当前主要研究成果和关键成果；报告工程科学的进展，讨论工程发展的热点、兴趣领域、挑战和前景，在工程中考虑人与环境的福祉和伦理道德，鼓励具有深远经济和社会意义的工程突破和创新，使之达到国际先进水平，成为新的生产力，从而改变世界，造福人类，创造新的未来。期刊链接：https://www.sciencedirect.com/journal/engineering

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日