基于属性的智能测试：在Python生态系统中发现缺陷 (Agentic Property-Based Testing: Finding Bugs Across the Python Ecosystem)

Property-based testing (PBT) is a lightweight formal method, typically implemented as a randomized testing framework. Users specify the input domain for their test using combinators supplied by the PBT framework, and the expected properties or invariants as a unit-test function. The framework then searches for a counterexample, e.g. by generating inputs and calling the test function. In this work, we demonstrate an LLM-based agent which analyzes Python modules, infers function-specific and cross-function properties from code and documentation, synthesizes and executes PBTs, reflects on outputs of these tests to confirm true bugs, and finally outputs actionable bug reports for the developer. We perform an extensive evaluation of our agent across 100 popular Python packages. Of the bug reports generated by the agent, we found after manual review that 56\% were valid bugs and 32\% were valid bugs that we would report to maintainers. We then developed a ranking rubric to surface high-priority valid bugs to developers, and found that of the 21 top-scoring bugs, 86\% were valid and 81\% we would report. The bugs span diverse failure modes from serialization failures to numerical precision errors to flawed cache implementations. We reported 5 bugs, 4 with patches, including to NumPy and cloud computing SDKs, with 3 patches merged successfully. Our results suggest that LLMs with PBT provides a rigorous and scalable method for autonomously testing software. Our code and artifacts are available at: https://github.com/mmaaz-git/agentic-pbt.

翻译：基于属性的测试是一种轻量级形式化方法，通常实现为随机化测试框架。用户通过PBT框架提供的组合器指定测试的输入域，并将预期属性或不变式定义为单元测试函数。随后，框架通过生成输入并调用测试函数等方式搜索反例。本研究提出一种基于LLM的智能体，能够分析Python模块、从代码和文档中推断函数级与跨函数属性、合成并执行PBT测试、通过测试输出反思以确认真实缺陷，最终生成可供开发者操作的缺陷报告。我们在100个主流Python包中对智能体进行了广泛评估。经人工审查发现，智能体生成的缺陷报告中56%为有效缺陷，32%为值得向维护者报告的有效缺陷。通过建立优先级排序标准筛选高优先级缺陷，在21个最高评分缺陷中，86%为有效缺陷且81%具备报告价值。这些缺陷涵盖序列化故障、数值精度错误、缓存实现缺陷等多种失效模式。我们已报告5个缺陷（其中4个附修复补丁），涉及NumPy及云计算SDK等库，目前已有3个补丁被成功合并。研究结果表明，结合PBT的LLM技术为自动化软件测试提供了严谨且可扩展的方法。代码与实验材料详见：https://github.com/mmaaz-git/agentic-pbt。