TestExplora: Benchmarking LLMs for Proactive Bug Discovery via Repository-Level Test Generation

Given that Large Language Models (LLMs) are increasingly applied to automate software development, comprehensive software assurance spans three distinct goals: regression prevention, reactive reproduction, and proactive discovery. Current evaluations systematically overlook the third goal. Specifically, they either treat existing code as ground truth (a compliance trap) for regression prevention, or depend on post-failure artifacts (e.g., issue reports) for bug reproduction-so they rarely surface defects before failures. To bridge this gap, we present TestExplora, a benchmark designed to evaluate LLMs as proactive testers within full-scale, realistic repository environments. TestExplora contains 2,389 tasks from 482 repositories and hides all defect-related signals. Models must proactively find bugs by comparing implementations against documentation-derived intent, using documentation as the oracle. Furthermore, to keep evaluation sustainable and reduce leakage, we propose continuous, time-aware data collection. Our evaluation reveals a significant capability gap: state-of-the-art models achieve a maximum Fail-to-Pass (F2P) rate of only 16.06%. Further analysis indicates that navigating complex cross-module interactions and leveraging agentic exploration are critical to advancing LLMs toward autonomous software quality assurance. Consistent with this, SWEAgent instantiated with GPT-5-mini achieves an F2P of 17.27% and an F2P@5 of 29.7%, highlighting the effectiveness and promise of agentic exploration in proactive bug discovery tasks.

翻译：鉴于大型语言模型（LLM）日益广泛地应用于软件开发的自动化，全面的软件质量保障涵盖三个不同目标：回归预防、被动复现和主动发现。现有评估方法系统性忽视了第三个目标。具体而言，这些方法要么将现有代码视为回归预防的基准（陷入合规性陷阱），要么依赖故障后产物（如问题报告）进行缺陷复现——因此极少能在故障发生前发现缺陷。为填补这一空白，我们提出了TestExplora，这是一个旨在全尺度真实仓库环境中评估LLM作为主动测试者能力的基准。TestExplora包含来自482个代码仓库的2,389项任务，并隐藏所有缺陷相关信号。模型必须通过将实现代码与文档推导的意图进行比对，以文档作为标准，主动发现缺陷。此外，为保持评估的可持续性并减少数据泄露，我们提出了连续且时间感知的数据收集方法。评估结果显示存在显著的能力差距：当前最先进的模型最高仅达到16.06%的失败转通过（F2P）率。进一步分析表明，驾驭复杂的跨模块交互及利用智能体探索能力，对于推动LLM实现自主软件质量保障至关重要。与此一致的是，基于GPT-5-mini构建的SWEAgent实现了17.27%的F2P率和29.7%的F2P@5率，凸显了智能体探索在主动缺陷发现任务中的有效性与潜力。