Understanding how and why large language models (LLMs) fail is becoming a central challenge as models rapidly evolve and static evaluations fall behind. While automated probing has been enabled by dynamic test generation, existing approaches often discover isolated failure cases, lack principled control over exploration, and provide limited insight into the underlying structure of model weaknesses. We propose ProbeLLM, a benchmark-agnostic automated probing framework that elevates weakness discovery from individual failures to structured failure modes. ProbeLLM formulates probing as a hierarchical Monte Carlo Tree Search, explicitly allocating limited probing budgets between global exploration of new failure regions and local refinement of recurring error patterns. By restricting probing to verifiable test cases and leveraging tool-augmented generation and verification, ProbeLLM grounds failure discovery in reliable evidence. Discovered failures are further consolidated into interpretable failure modes via failure-aware embeddings and boundary-aware induction. Across diverse benchmarks and LLMs, ProbeLLM reveals substantially broader, cleaner, and more fine-grained failure landscapes than static benchmarks and prior automated methods, supporting a shift from case-centric evaluation toward principled weakness discovery.
翻译:随着模型快速演进而静态评估日趋滞后,理解大语言模型(LLM)如何及为何失效已成为核心挑战。尽管动态测试生成已实现自动化探测,但现有方法常发现孤立故障案例,缺乏对探索过程的原则性管控,且对模型弱点内在结构的洞察有限。我们提出ProbeLLM——一种与基准无关的自动化探测框架,将弱点发现从个体故障提升至结构化故障模式。ProbeLLM将探测建模为分层蒙特卡洛树搜索,在新故障区域的全局探索与重复错误模式的局部精炼之间显式分配有限的探测预算。通过将探测限定于可验证测试案例,并借助工具增强生成与验证机制,ProbeLLM将故障发现锚定于可靠证据。发现的故障进一步通过故障感知嵌入与边界感知归纳整合为可解释的故障模式。在多样化的基准测试与LLM上,ProbeLLM相较于静态基准与先前的自动化方法,揭示了更广泛、更清晰、更细粒度的故障景观,推动从案例中心评估转向原则性弱点发现。