We introduce ABD, a benchmark for default-exception abduction over finite first-order worlds. Given a background theory with an abnormality predicate and a set of relational structures, a model must output a first-order formula that defines exceptions, restoring satisfiability while keeping exceptions sparse. We formalize three observation regimes (closed-world, existential completion, universal completion) with exact SMT verification. Evaluating ten frontier LLMs on 600 instances, the best models achieve high validity but parsimony gaps remain, and holdout evaluation reveals distinct generalization failure modes across regimes.
翻译:我们提出了ABD,一个面向有限一阶世界中默认例外溯因的基准测试集。给定一个包含异常谓词和一组关系结构的背景理论,模型需输出一个定义例外的一阶公式,在保持例外稀疏性的同时恢复可满足性。我们形式化了三种观测模式(封闭世界、存在完备化、全称完备化),并采用精确的SMT验证。对十个前沿LLM在600个实例上的评估显示,最佳模型虽能达到高有效性,但简约性差距依然存在,且留出评估揭示了不同观测模式下截然不同的泛化失败模式。