TDD Without Tears: Towards Test Case Generation from Requirements through Deep Reinforcement Learning

Test-driven development (TDD) is a widely-employed software development practice that mandates writing test cases based on requirements before writing the actual code. While writing test cases is the centerpiece of TDD, it is time-consuming, expensive, and often shunned by developers. To address these issues associated with TDD, automated test case generation approaches have recently been investigated. Such approaches take source code as input, but not the requirements. Therefore, existing work does not fully support true TDD, as actual code is required to generate test cases. In addition, current deep learning-based test case generation approaches are trained with one learning objective, i.e., to generate test cases that are exactly matched with the ground-truth test cases. However, such approaches may limit the model's ability to generate different yet correct test cases. In this paper, we introduce PyTester, a Text-to-Testcase generation approach that can automatically generate syntactically correct, executable, complete, and effective test cases while being aligned with a given natural language requirement. We evaluate PyTester on the public APPS benchmark dataset, and the results show that our Deep RL approach enables PyTester, a small language model, to outperform much larger language models like GPT3.5, StarCoder, and InCoder. Our findings suggest that future research could consider improving small over large LMs for better resource efficiency by integrating the SE domain knowledge into the design of reinforcement learning architecture.

翻译：测试驱动开发（TDD）是一种广泛采用的软件开发实践，要求在编写实际代码之前，先根据需求编写测试用例。虽然编写测试用例是TDD的核心环节，但这一过程耗时长、成本高，且常被开发者所回避。为解决TDD中存在的这些问题，近年来研究者探索了自动化测试用例生成方法。现有方法将源代码而非需求作为输入，因此无法真正支持TDD——因为生成测试用例时仍需要实际代码。此外，当前基于深度学习的测试用例生成方法采用单一学习目标进行训练，即生成与真实测试用例完全一致的用例。然而，这种策略可能限制模型生成不同但正确的测试用例的能力。本文提出PyTester，一种文本到测试用例的生成方法，能够自动生成语法正确、可执行、完整且有效的测试用例，同时与给定自然语言需求保持一致。我们在公开的APPS基准数据集上评估了PyTester，结果表明，基于深度强化学习的方法使小语言模型PyTester在性能上超越了GPT-3.5、StarCoder和InCoder等更大规模的语言模型。研究发现表明，未来研究可通过将软件工程领域知识融入强化学习架构设计，重点提升小语言模型而非大型语言模型的资源效率。

相关内容

CASES

关注 4

CASES：International Conference on Compilers, Architectures, and Synthesis for Embedded Systems。 Explanation：嵌入式系统编译器、体系结构和综合国际会议。 Publisher：ACM。 SIT： http://dblp.uni-trier.de/db/conf/cases/index.html

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日