Software specifications are essential for ensuring the reliability of software systems. Existing specification extraction approaches, however, suffer from limited generalizability and require manual efforts. We study the effectiveness of Large Language Models (LLMs) in generating software specifications from software documentation, utilizing Few-Shot Learning (FSL) to enable LLMs to generalize from a small number of examples. We compare the performance of LLMs with FSL to that of state-of-the-art specification extraction techniques and study the impact of prompt construction strategies on LLM performance. In addition, we conduct a comprehensive analysis of their symptoms and root causes of the failures to understand the pros and cons of LLMs and existing approaches. We also compare 11 LLMs' performance, cost, and response time for generating software specifications. Our findings include that (1) the best performing LLM outperforms existing approaches by 9.1--13.7% with a few similar examples, (2) the two dominant root causes combined (ineffective prompts and missing domain knowledge) result in 57--60% of LLM failures, and (3) most of the 11 LLMs achieve better or comparable performance compared to traditional techniques. Our study offers valuable insights for future research to improve specification generation.
翻译:软件规格说明对于确保软件系统的可靠性至关重要。然而,现有的规格提取方法泛化能力有限且需要手动操作。本研究探讨了大型语言模型在从软件文档生成规格说明时的有效性,利用少样本学习使大型语言模型能够通过少量示例进行泛化。我们将带少样本学习的大型语言模型性能与最先进的规格提取技术进行了比较,并研究了提示构建策略对大型语言模型性能的影响。此外,我们对失败的症状和根本原因进行了全面分析,以理解大型语言模型与现有方法的优缺点。我们还比较了11种大型语言模型在生成软件规格说明时的性能、成本和响应时间。研究发现包括:(1)性能最佳的大型语言模型在少量相似示例下比现有方法高出9.1%-13.7%;(2)无效提示和缺失领域知识这两个主要根本原因共同导致了大型语言模型57%-60%的失败;(3)与传统技术相比,11种大型语言模型中的大多数实现了更好或相当的性能。本研究为未来改进规格生成的探索提供了宝贵的见解。