When language models correctly parse "The cat that the dog chased meowed," are they analyzing syntax or simply familiar with dogs chasing cats? Despite extensive benchmarking, we lack methods to distinguish structural understanding from semantic pattern matching. We introduce CenterBench, a dataset of 9,720 comprehension questions on center-embedded sentences (like "The cat [that the dog chased] meowed") where relative clauses nest recursively, creating processing demands from simple to deeply nested structures. Each sentence has a syntactically identical but semantically implausible counterpart (e.g., mailmen prescribe medicine, doctors deliver mail) and six comprehension questions testing surface understanding, syntactic dependencies, and causal reasoning. Testing six models reveals that performance gaps between plausible and implausible sentences widen systematically with complexity, with models showing median gaps up to 26.8 percentage points, quantifying when they abandon structural analysis for semantic associations. Notably, semantic plausibility harms performance on questions about resulting actions, where following causal relationships matters more than semantic coherence. Reasoning models improve accuracy but their traces show semantic shortcuts, overthinking, and answer refusal. Unlike models whose plausibility advantage systematically widens with complexity, humans shows variable semantic effects. CenterBench provides the first framework to identify when models shift from structural analysis to pattern matching.
翻译:当语言模型正确解析"The cat that the dog chased meowed"时,它们是在分析句法结构,还是仅仅基于"狗追猫"的语义关联?尽管已有大量基准测试,我们仍缺乏有效方法来区分结构理解与语义模式匹配。本文提出CenterBench数据集,包含9,720个关于中心嵌套句(如"The cat [that the dog chased] meowed")的理解问题,其中关系从句递归嵌套,形成从简单到深度嵌套的处理需求。每个句子都配有句法结构完全相同但语义上不合理的对照版本(例如"邮递员开药方,医生送邮件"),并设置六个理解问题,分别测试表层理解、句法依赖关系和因果推理能力。对六个模型的测试表明,合理句与不合理句之间的性能差距随复杂度增加而系统性扩大,模型表现出的中位数差距最高达26.8个百分点,这量化了模型何时会放弃结构分析而转向语义关联。值得注意的是,语义合理性反而会损害对结果动作相关问题的性能表现,这类问题中遵循因果关系比保持语义连贯更为重要。推理模型虽能提高准确率,但其思维轨迹显示出语义捷径、过度思考和答案拒绝等现象。与模型表现出的合理性优势随复杂度系统性扩大的趋势不同,人类受语义影响的程度存在较大变异。CenterBench首次提供了识别模型何时从结构分析转向模式匹配的量化框架。