The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning

Large language models fail when a salient surface cue conflicts with an unstated feasibility constraint. We introduce the Heuristic Override Benchmark (HOB): 500 instances spanning 4 heuristic families and 5 constraint families, with minimal pairs and explicitness gradients. We pair HOB with a falsifiable behavioral characterization following a diagnose-measure-bridge-treat arc. Causal-behavioral analysis of the car wash problem across six models reveals context-independent sigmoid heuristics: the distance cue has 8.7 to 38 times more influence than the goal, and attribution better matches keyword association than compositional inference. Across 14 models, strict 10/10 evaluation shows that no model exceeds 75%, and presence constraints are hardest at 44%. A minimal hint improves performance by 15 pp, suggesting a constraint-inference failure rather than missing knowledge. However, 12 of 14 models perform worse when the constraint is removed, by up to 39 pp, revealing conservative bias. A thinking-mode ablation on Gemini 3.1 Pro drops performance from 74.6% with thinking on to 58.4% with thinking off, while explicit goal decomposition recovers it to 71.2%. Thus, internal deliberation does useful work, and explicit prompting can partially substitute for it. Reasoning models do not categorically outperform non-reasoning peers: after controlling for capability rank, the residual reasoning-mode effect is 1.8 pp and is not significant. Parametric probes show that the sigmoid pattern generalizes to cost, efficiency, and semantic-similarity heuristics. Goal-decomposition prompting improves performance by 5.0 pp, compared with 3.1 pp for generic chain-of-thought, isolating constraint enumeration as the active ingredient. Overall, heuristic override is a systematic reasoning vulnerability with a quantified locus in inference order, not knowledge, and a tested intervention.

翻译：大型语言模型在显性表面线索与未言明的可行性约束相冲突时会出现失效。我们引入启发式覆盖基准（HOB）：涵盖4类启发式家族与5类约束家族共500个样本，包含最小对比对与显性梯度。我们为HOB配套提出遵循"诊断-测量-桥梁-治疗"弧线的可证伪行为表征框架。对六种模型的洗车问题因果行为分析揭示出情境独立的S型启发式：距离线索的影响力比目标高8.7至38倍，且归因机制更匹配关键词关联而非组合推理。针对14种模型的严格10/10评估显示，无模型准确率超过75%，其中存在性约束最难（44%）。最小化提示使性能提升15个百分点，表明问题在于约束推断失败而非知识缺失。然而，14种模型中有12种在移除约束后性能下降，降幅最高达39个百分点，揭示出保守偏差。对Gemini 3.1 Pro进行思维模式消融实验：开启思考时性能为74.6%，关闭思考后降至58.4%，而显式目标分解可将性能恢复至71.2%。因此，内部推理确实发挥作用，显式提示可部分替代之。推理模型并未显著优于非推理模型：控制能力层级后，残差推理模式效应仅为1.8个百分点且不显著。参数探针显示S型模式可泛化至成本、效率和语义相似性启发式。目标分解提示使性能提升5.0个百分点，而通用思维链仅提升3.1个百分点，表明约束枚举是有效成分。总体而言，启发式覆盖是一种系统性推理漏洞，其量化根源在于推理顺序而非知识层面，且我们已提出经过验证的干预措施。