The Dog the Cat Chased Stumped the Model: Measuring When Language Models Abandon Structure for Shortcuts

When language models correctly parse "The cat that the dog chased meowed," are they analyzing syntax or simply familiar with dogs chasing cats? Despite extensive benchmarking, we lack methods to distinguish structural understanding from semantic pattern matching. We introduce CenterBench, a dataset of 9,720 comprehension questions on center-embedded sentences (like "The cat [that the dog chased] meowed") where relative clauses nest recursively, creating processing demands from simple to deeply nested structures. Each sentence has a syntactically identical but semantically implausible counterpart (e.g., mailmen prescribe medicine, doctors deliver mail) and six comprehension questions testing surface understanding, syntactic dependencies, and causal reasoning. Testing six models reveals that performance gaps between plausible and implausible sentences widen systematically with complexity, with models showing median gaps up to 26.8 percentage points, quantifying when they abandon structural analysis for semantic associations. Notably, semantic plausibility harms performance on questions about resulting actions, where following causal relationships matters more than semantic coherence. Reasoning models improve accuracy but their traces show semantic shortcuts, overthinking, and answer refusal. Unlike models whose plausibility advantage systematically widens with complexity, humans shows variable semantic effects. CenterBench provides the first framework to identify when models shift from structural analysis to pattern matching.

翻译：当语言模型正确解析“追逐猫的狗吠叫”时，它们是在分析句法结构，还是仅仅基于“狗追逐猫”的语义模式进行匹配？尽管已有大量基准测试，我们仍缺乏有效方法来区分结构理解与语义模式匹配。本文提出CenterBench数据集，包含9,720个关于中心嵌套句（例如“被狗追逐的猫喵喵叫”）的理解性问题，其中关系从句以递归方式嵌套，形成从简单到深度嵌套的处理需求。每个句子都配有句法结构完全相同但语义上不合理的对照版本（例如“邮递员开药方，医生送邮件”），并通过六个理解性问题分别测试表层理解、句法依赖关系和因果推理能力。对六个模型的测试表明：合理句与不合理句之间的性能差距随复杂度增加而系统性扩大，模型的中位差距最高达26.8个百分点，这量化了模型何时会放弃结构分析而转向语义关联。值得注意的是，语义合理性反而会损害对结果动作相关问题的性能表现，这类问题中遵循因果关系比保持语义连贯更为重要。推理模型虽能提升准确率，但其推理轨迹仍显示存在语义捷径、过度思考和答案拒绝现象。与模型表现不同——其合理性优势随复杂度系统性扩大，人类则表现出多变的语义效应。CenterBench首次提供了识别模型何时从结构分析转向模式匹配的量化框架。