Property-based testing (PBT), while an established technique in the software testing research community, is still relatively underused in real-world software. Pain points in writing property-based tests include implementing diverse random input generators and thinking of meaningful properties to test. Developers, however, are more amenable to writing documentation; plenty of library API documentation is available and can be used as natural language specifications for property-based tests. As large language models (LLMs) have recently shown promise in a variety of coding tasks, we explore the potential of using LLMs to synthesize property-based tests. We call our approach PBT-GPT, and propose three different strategies of prompting the LLM for PBT. We characterize various failure modes of PBT-GPT and detail an evaluation methodology for automatically synthesized property-based tests. PBT-GPT achieves promising results in our preliminary studies on sample Python library APIs in $\texttt{numpy}$, $\texttt{networkx}$, and $\texttt{datetime}$.
翻译:属性基测试(Property-Based Testing, PBT)作为软件测试研究领域中一项成熟的技术,在实际软件中仍相对较少被采用。编写属性基测试的痛点包括实现多样化的随机输入生成器以及构思有意义的待测属性。然而,开发者更倾向于编写文档;大量库的API文档可供使用,并能作为属性基测试的自然语言规范。鉴于大型语言模型(LLMs)近期在多种编程任务中展现出潜力,我们探索了利用LLMs合成属性基测试的可能性。我们将此方法称为PBT-GPT,并提出三种不同的提示LLM生成PBT的策略。我们刻画了PBT-GPT的各类失败模式,并详细阐述了一种针对自动合成属性基测试的评估方法。在针对$\texttt{numpy}$、$\texttt{networkx}$和$\texttt{datetime}$等Python库API样本的初步研究中,PBT-GPT取得了令人鼓舞的结果。