Validation is a central activity when developing formal specifications. Similarly to coding, a possible validation technique is to define upfront test cases or scenarios that a future specification should satisfy or not. Unfortunately, specifying such test cases is burdensome and error prone, which could cause users to skip this validation task. This paper reports the results of an empirical evaluation of using pre-trained large language models (LLMs) to automate the generation of test cases from natural language requirements. In particular, we focus on test cases for structural requirements of simple domain models formalized in the Alloy specification language. Our evaluation focuses on the state-of-art GPT-5 model, but results from other closed- and open-source LLMs are also reported. The results show that, in this context, GPT-5 is already quite effective at generating positive (and negative) test cases that are syntactically correct and that satisfy (or not) the given requirement, and that can detect many wrong specifications written by humans.
翻译:验证是开发形式化规范的核心活动。与编码类似,一种可行的验证技术是预先定义测试用例或场景,这些用例或场景应被(或不应被)未来的规范所满足。然而,指定此类测试用例既繁琐又容易出错,可能导致用户跳过此项验证任务。本文报告了一项实证评估的结果,该评估利用预训练大语言模型(LLMs)从自然语言需求自动生成测试用例。我们特别关注针对以Alloy规范语言形式化的简单领域模型的结构性需求所生成的测试用例。我们的评估主要聚焦于当前最先进的GPT-5模型,同时也报告了其他闭源与开源LLM的结果。结果表明,在此背景下,GPT-5已能相当有效地生成语法正确、满足(或不满足)给定需求的正向(及负向)测试用例,并且能够检测出许多由人工编写的错误规范。