We introduce ABD, a benchmark for default-exception abduction over finite first-order worlds. Given a background theory with an abnormality predicate and a set of relational structures, a model must output a first-order formula that defines exceptions, restoring satisfiability while keeping exceptions sparse. We formalize three observation regimes (closed-world, existential completion, universal completion) with exact SMT verification. Evaluating ten frontier LLMs on 600 instances, the best models achieve high validity but parsimony gaps remain, and holdout evaluation reveals distinct generalization failure modes across regimes.
翻译:我们提出ABD,这是一个针对有限一阶世界上的缺省异常溯因任务的基准测试。给定一个包含异常谓词和一组关系结构的背景理论,模型需要输出一个定义异常的一阶公式,在保持异常稀疏性的同时恢复可满足性。我们形式化了三种观察范式(封闭世界、存在完成、全称完成),并采用精确的SMT验证。在600个实例上对十种前沿大语言模型进行评估,最优模型虽能达到高有效性,但简约性缺口依然存在,且留出评估揭示了不同范式间截然不同的泛化失效模式。