We introduce ABD, a benchmark for default-exception abduction over finite first-order worlds. Given a background theory with an abnormality predicate and a set of relational structures, a model must output a first-order formula that defines exceptions, restoring satisfiability while keeping exceptions sparse. We formalize three observation regimes (closed-world, existential completion, universal completion) with exact SMT verification. Evaluating ten frontier LLMs on 600 instances, the best models achieve high validity but parsimony gaps remain, and holdout evaluation reveals distinct generalization failure modes across regimes.
翻译:本文提出ABD,一个面向有限一阶世界中默认-异常溯因任务的基准测试。给定包含异常谓词的背景理论及一组关系结构,模型需输出一个定义异常情况的一阶公式,在恢复可满足性的同时保持异常描述的稀疏性。我们通过精确的SMT验证形式化了三种观测机制(封闭世界、存在性补全、全称补全)。在600个测试实例上对十个前沿大语言模型进行评估,最佳模型虽能实现较高的有效性,但简洁性仍存在差距;留出评估进一步揭示了不同观测机制下各异的泛化失效模式。