Understanding how and why large language models (LLMs) fail is becoming a central challenge as models rapidly evolve and static evaluations fall behind. While automated probing has been enabled by dynamic test generation, existing approaches often discover isolated failure cases, lack principled control over exploration, and provide limited insight into the underlying structure of model weaknesses. We propose ProbeLLM, a benchmark-agnostic automated probing framework that elevates weakness discovery from individual failures to structured failure modes. ProbeLLM formulates probing as a hierarchical Monte Carlo Tree Search, explicitly allocating limited probing budgets between global exploration of new failure regions and local refinement of recurring error patterns. By restricting probing to verifiable test cases and leveraging tool-augmented generation and verification, ProbeLLM grounds failure discovery in reliable evidence. Discovered failures are further consolidated into interpretable failure modes via failure-aware embeddings and boundary-aware induction. Across diverse benchmarks and LLMs, ProbeLLM reveals substantially broader, cleaner, and more fine-grained failure landscapes than static benchmarks and prior automated methods, supporting a shift from case-centric evaluation toward principled weakness discovery.
翻译:理解大语言模型(LLM)如何及为何失效,正随着模型的快速演进而静态评估的滞后,成为一个核心挑战。虽然动态测试生成已使自动化探测成为可能,但现有方法往往只能发现孤立的失效案例,缺乏对探索过程的原理性控制,并且对模型弱点的底层结构提供有限洞察。我们提出了ProbeLLM,一个与基准无关的自动化探测框架,将弱点发现从个体失效提升至结构化失效模式。ProbeLLM将探测构建为一种分层蒙特卡洛树搜索,明确地在全局探索新失效区域与局部细化重复错误模式之间分配有限的探测预算。通过将探测限制在可验证的测试用例,并利用工具增强的生成与验证,ProbeLLM将失效发现建立在可靠的证据之上。发现的失效通过失效感知嵌入和边界感知归纳进一步整合为可解释的失效模式。在多样化的基准测试和LLM上,ProbeLLM揭示了比静态基准和先前自动化方法更广泛、更清晰、更细粒度的失效图景,支持从以案例为中心的评估向原则性弱点发现的转变。