PBT-Bench: Benchmarking AI Agents on Property-Based Testing

Existing code benchmarks measure whether an agent can produce any test that reproduces a known bug, or whether it can produce a patch that fixes a described issue. Neither isolates the distinct skill of property-based testing: deriving a semantic invariant from documentation, and then constructing an input-generation strategy precise enough to make a random search reveal the violation. We introduce PBT-Bench, a benchmark of 100 curated property-based testing problems across 40 real Python libraries. Each problem injects one or more semantic bugs (365 in total, mean 3.65 per problem) designed so that default-strategy random inputs almost never trigger them; the agent must read the library's documentation, identify the relevant invariant, and specify a Hypothesis @given strategy that concentrates mass in the trigger region. Bugs are stratified across three difficulty levels (L1-L3) spanning single-constraint boundary bugs to stateful, cross-function protocol violations. We evaluate eight contemporary LLMs under two prompting regimes (open-ended baseline vs. explicit Hypothesis scaffolding) for three independent runs per configuration. Bug recall under the PBT-guided prompt ranges from 42.1% to 83.4% across models; under the open-ended baseline, from 31.4% to 76.7%. Hypothesis scaffolding lifts mid-capability models by over 20 percentage points, but yields smaller gains for the strongest models, with two exceptions showing degradation, suggesting the structured prompt can interfere with certain model behaviours rather than complementing them. The hardest bugs prove model-specific: different architectures fail on different problems, leaving persistent gaps that no single model closes. We release the benchmark, harness, and full evaluation corpus to support downstream work on documentation-grounded semantic reasoning.

翻译：现有代码基准测试衡量智能体能否生成再现已知缺陷的测试用例，或能否产生修复已描述问题的补丁。但两者均未隔离属性测试的核心技能：从文档中推导语义不变量，并构建足够精确的输入生成策略，使随机搜索能够揭示违反行为。我们提出PBT-Bench，这是一个包含40个真实Python库中100个精选属性测试问题的基准测试集。每个问题注入一个或多个语义缺陷（共365个，平均每问题3.65个），设计使得默认策略下的随机输入几乎无法触发这些缺陷；智能体必须阅读库文档，识别相关不变量，并指定Hypothesis @given策略以集中质量于触发区域。缺陷按三个难度级别（L1-L3）分层，涵盖从单约束边界缺陷到有状态跨函数协议违规。我们评估了八种当代LLM在两种提示方案（开放式基线vs.显式Hypothesis脚手架）下、每种配置进行三次独立运行的性能。在PBT引导的提示下，模型缺陷召回率范围为42.1%至83.4%；在开放式基线方案下则为31.4%至76.7%。Hypothesis脚手架使中等能力模型提升超过20个百分点，但对最强模型改进较小，且有两个例外出现性能下降，表明结构化提示可能干扰而非补充某些模型行为。最难的缺陷具有模型特异性：不同架构在不同问题上失败，留下任何单一模型都无法弥补的持续空白。我们发布该基准测试、测试框架及完整评估语料库，以支持基于文档的语义推理下游研究。