The Dog the Cat Chased Stumped the Model: Measuring When Language Models Abandon Structure for Shortcuts

When language models correctly parse "The cat that the dog chased meowed," are they analyzing syntax or simply familiar with dogs chasing cats? Despite extensive benchmarking, we lack methods to distinguish structural understanding from semantic pattern matching. We introduce CenterBench, a dataset of 9,720 comprehension questions on center-embedded sentences (like "The cat [that the dog chased] meowed") where relative clauses nest recursively, creating processing demands from simple to deeply nested structures. Each sentence has a syntactically identical but semantically implausible counterpart (e.g., mailmen prescribe medicine, doctors deliver mail) and six comprehension questions testing surface understanding, syntactic dependencies, and causal reasoning. Testing six models reveals that performance gaps between plausible and implausible sentences widen systematically with complexity, with models showing median gaps up to 26.8 percentage points, quantifying when they abandon structural analysis for semantic associations. Notably, semantic plausibility harms performance on questions about resulting actions, where following causal relationships matters more than semantic coherence. Reasoning models improve accuracy but their traces show semantic shortcuts, overthinking, and answer refusal. Unlike models whose plausibility advantage systematically widens with complexity, humans shows variable semantic effects. CenterBench provides the first framework to identify when models shift from structural analysis to pattern matching.

翻译：当语言模型正确解析"The cat that the dog chased meowed"时，它们是在分析句法结构，还是仅仅基于"狗追猫"的语义关联？尽管已有大量基准测试，我们仍缺乏有效方法来区分结构理解与语义模式匹配。本文提出CenterBench数据集，包含9,720个关于中心嵌套句（如"The cat [that the dog chased] meowed"）的理解问题，其中关系从句递归嵌套，形成从简单到深度嵌套的处理需求。每个句子都配有句法结构完全相同但语义上不合理的对照版本（例如"邮递员开药方，医生送邮件"），并设置六个理解问题，分别测试表层理解、句法依赖关系和因果推理能力。对六个模型的测试表明，合理句与不合理句之间的性能差距随复杂度增加而系统性扩大，模型表现出的中位数差距最高达26.8个百分点，这量化了模型何时会放弃结构分析而转向语义关联。值得注意的是，语义合理性反而会损害对结果动作相关问题的性能表现，这类问题中遵循因果关系比保持语义连贯更为重要。推理模型虽能提高准确率，但其思维轨迹显示出语义捷径、过度思考和答案拒绝等现象。与模型表现出的合理性优势随复杂度系统性扩大的趋势不同，人类受语义影响的程度存在较大变异。CenterBench首次提供了识别模型何时从结构分析转向模式匹配的量化框架。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

37+阅读 · 2019年10月17日