Sakura: An Approach for Generating Complex Tests from Natural Language Test Descriptions

Testing is a core activity in software development workflows, and research on its automation has spanned several decades. Most existing approaches generate unit tests for individual methods, validate isolated API endpoints, or target user interface (UI) layers, with non-API and non-UI automated test generators typically exercising only a single focal method. Recent empirical evidence shows a substantial gap between such generated tests and developer-written ones, which often span multiple focal methods, involve complex call sequences, and contain elaborate assertions that current automated approaches fail to capture. To address this gap, we propose generating tests from natural language (NL) descriptions of developer intent. We present Sakura, the first agent-based framework for generating structurally complex test cases from NL descriptions. Sakura decomposes NL descriptions into structured blocks and processes them using a multi-agent system consisting of a localization agent that grounds test steps in concrete application code via static analysis, a composition agent that synthesizes compilable test code and iteratively refines it using execution feedback, and a supervisor agent that coordinates agent interactions. To evaluate Sakura, we curate a novel dataset of NL test descriptions at three levels of abstraction, systematically generated from developer-written tests mined from Apache Commons projects. Across 20 applications and 1,464 test scenarios, Sakura significantly outperforms off-the-shelf agentic tools such as Gemini CLI. Specifically, Sakura achieves 50-78% higher test compilability and 38-66% higher coverage overlap with ground-truth tests compared to baselines using the same models. Moreover, Sakura paired with small open-source models such as Devstral Small 2 and Qwen3-Coder outperforms Gemini CLI using large proprietary models, while also being more cost-effective.

翻译：测试是软件开发流程中的核心活动，对其自动化的研究已跨越数十年。现有大多数方法多为单个方法生成单元测试、验证孤立API端点，或针对用户界面（UI）层进行测试，而非面向API和UI的自动化测试生成器通常只针对单一焦点方法。最新实证证据表明，此类自动生成的测试与开发者编写的测试之间存在显著差距——后者往往跨越多个焦点方法、涉及复杂调用序列并包含精细断言，这些特征难以被当前自动化方法捕捉。为解决这一差距，我们提出从开发者意图的自然语言（NL）描述中生成测试的方法。我们提出Sakura，这是首个基于自然语言描述生成结构复杂测试用例的智能体框架。Sakura将自然语言描述分解为结构化块，并通过多智能体系统进行处理：定位智能体通过静态分析将测试步骤锚定到具体应用代码；组合智能体合成可编译测试代码，并利用执行反馈迭代优化；监督智能体协调智能体间的交互。为评估Sakura，我们构建了一个包含三个抽象层次的自然语言测试描述新数据集，这些描述系统化地源于从Apache Commons项目中挖掘的开发者编写测试。在20个应用及1,464个测试场景中，Sakura显著优于Gemini CLI等现成智能体工具。具体而言，与使用相同模型的基线方法相比，Sakura的测试可编译性提升50-78%，与真实测试的重叠覆盖度提高38-66%。此外，Sakura搭配Devstral Small 2和Qwen3-Coder等小型开源模型时，其性能优于采用大型专有模型的Gemini CLI，同时更具成本效益。