AI-Driven Test Case Generation from Natural Language Requirements: A Survey of Techniques and Research Gaps

Software testing is critical for verifying that systems meet specified requirements, yet remains among the most time-consuming and expensive activities in development. Requirements-based test generation allows test cases to be derived early from requirements artifacts, but generating them directly from natural language is challenging due to inherent ambiguity and imprecision. Recent advances in AI, natural language processing (NLP), and large language models (LLMs) have made automating this pipeline increasingly feasible, while introducing new risks including hallucination, reduced traceability, and inconsistent evaluation. This survey addresses four research questions: what AI and NLP techniques have been proposed for generating test cases from natural language requirements; what tools and frameworks support these approaches; how generated test cases are evaluated; and what research gaps remain. Following Kitchenham and Charters' systematic review guidelines, we searched major scholarly databases spanning 2000-2025 and, after applying strict inclusion criteria, identified 21 primary studies. The literature is organized into three evolutionary eras, revealing that no existing approach simultaneously satisfies six key quality dimensions: automation, ambiguity handling, domain applicability, traceability, evaluation thoroughness, and hallucination control. The survey makes three main contributions: a three-era evolutionary synthesis of AI-based test generation; a six-criteria gap analysis showing no current approach fully addresses all quality dimensions; and four actionable research guidelines targeting hallucination, traceability, complexity sensitivity, and compliance.

翻译：软件测试是验证系统是否满足规定需求的关键环节，但仍然是开发过程中最耗时、最昂贵的活动之一。基于需求的测试生成允许从需求工件早期衍生测试用例，但由于自然语言固有的模糊性和不精确性，直接从自然语言生成测试用例具有挑战性。人工智能、自然语言处理和大语言模型的最新进展，使得自动化这一流程日益可行，同时也引入了新的风险，包括幻觉、可追溯性降低和评估不一致。本综述提出四个研究问题：哪些人工智能和自然语言处理技术被提出用于从自然语言需求生成测试用例；哪些工具和框架支持这些方法；生成的测试用例如何评估；以及存在哪些研究空白。遵循Kitchenham和Charters的系统性综述指南，我们检索了2000年至2025年的主要学术数据库，并在应用严格纳入标准后，确定了21篇主要研究。文献按三个进化时代进行组织，结果显示现有方法没有一种能同时满足六个关键质量维度：自动化、歧义处理、领域适用性、可追溯性、评估全面性和幻觉控制。本综述做出三项主要贡献：基于人工智能的测试生成的三时代进化综合；六标准空白分析，表明目前没有一种方法完全涵盖所有质量维度；以及四项面向幻觉、可追溯性、复杂性敏感性和合规性的可行研究指南。