Given that Large Language Models (LLMs) are increasingly applied to automate software development, comprehensive software assurance spans three distinct goals: regression prevention, reactive reproduction, and proactive discovery. Current evaluations systematically overlook the third goal. Specifically, they either treat existing code as ground truth (a compliance trap) for regression prevention, or depend on post-failure artifacts (e.g., issue reports) for bug reproduction-so they rarely surface defects before failures. To bridge this gap, we present TestExplora, a benchmark designed to evaluate LLMs as proactive testers within full-scale, realistic repository environments. TestExplora contains 2,389 tasks from 482 repositories and hides all defect-related signals. Models must proactively find bugs by comparing implementations against documentation-derived intent, using documentation as the oracle. Furthermore, to keep evaluation sustainable and reduce leakage, we propose continuous, time-aware data collection. Our evaluation reveals a significant capability gap: state-of-the-art models achieve a maximum Fail-to-Pass (F2P) rate of only 16.06%. Further analysis indicates that navigating complex cross-module interactions and leveraging agentic exploration are critical to advancing LLMs toward autonomous software quality assurance. Consistent with this, SWEAgent instantiated with GPT-5-mini achieves an F2P of 17.27% and an F2P@5 of 29.7%, highlighting the effectiveness and promise of agentic exploration in proactive bug discovery tasks.
翻译:鉴于大语言模型(LLMs)日益广泛地应用于软件开发的自动化过程,全面的软件质量保障涵盖三个不同目标:回归预防、被动复现和主动发现。现有评测体系系统性忽视了第三个目标。具体而言,它们或将现有代码视为回归预防的基准事实(陷入合规性陷阱),或依赖故障后产物(如问题报告)进行缺陷复现——因此极少能在故障发生前揭示缺陷。为弥补这一空白,我们提出了TestExplora,这是一个旨在全尺度、真实的代码仓库环境中评估LLMs作为主动测试者能力的基准。TestExplora包含来自482个代码仓库的2,389项任务,并隐藏了所有与缺陷相关的信号。模型必须通过将实现与文档推导的意图进行比对,以文档作为判断准则,主动发现缺陷。此外,为确保评估的可持续性并减少数据泄露,我们提出了连续、时间感知的数据收集方法。我们的评估揭示了显著的能力差距:最先进的模型仅能达到16.06%的最大失败转通过(F2P)率。进一步分析表明,驾驭复杂的跨模块交互和利用智能体探索能力对于推动LLMs实现自主软件质量保障至关重要。与此一致,基于GPT-5-mini实例化的SWEAgent实现了17.27%的F2P率和29.7%的F2P@5率,凸显了智能体探索在主动缺陷发现任务中的有效性和潜力。