Context: Recently, many illustrative examples have shown ChatGPT's impressive ability to perform programming tasks and answer general domain questions. Objective: We empirically evaluate how ChatGPT performs on requirements analysis tasks to derive insights into how generative large language model, represented by ChatGPT, influence the research and practice of natural language processing for requirements engineering. Method: We design an evaluation pipeline including two common requirements information retrieval tasks, four public datasets involving two typical requirements artifacts, querying ChatGPT with fixed task prompts, and quantitative and qualitative results analysis. Results: Quantitative results show that ChatGPT achieves comparable or better $F\beta$ values in all datasets under a zero-shot setting. Qualitative analysis further illustrates ChatGPT's powerful natural language processing ability and limited requirements engineering domain knowledge. Conclusion: The evaluation results demonstrate ChatGPT' impressive ability to retrieve requirements information from different types artifacts involving multiple languages under a zero-shot setting. It is worthy for the research and industry communities to study generative large language model based requirements retrieval models and to develop corresponding tools.
翻译:背景:近期诸多实例表明ChatGPT在编程任务及通用领域问题解答中展现出卓越能力。目的:本文通过实证评估ChatGPT在需求分析任务中的表现,探讨以ChatGPT为代表的生成式大语言模型对基于自然语言处理的需求工程研究与实践的影响。方法:设计包含两项常见需求信息检索任务的评估流程,采用涉及两类典型需求工件的四个公开数据集,通过固定任务提示查询ChatGPT,并进行定量与定性结果分析。结果:定量结果显示,在零样本场景下,ChatGPT在所有数据集上的$F\beta$值均达到或超过对比方法。定性分析进一步表明ChatGPT具备强大的自然语言处理能力,但需求工程领域知识有限。结论:评估结果表明,ChatGPT在零样本条件下从多语言需求工件中检索信息的能力令人瞩目。研究界与工业界值得深入探索基于生成式大语言模型的需求检索模型及相应工具开发。