When Built-in Thinking Helps and Hurts: Constraint-Level Error Shifts in Instruction Following

Large reasoning models (LRMs) often improve math and coding performance, but their effect on instruction following is unclear. We study IFEval with Qwen3 models (1.7B-32B), using same-weights Thinking ON/OFF controls; four Hunyuan models provide directional cross-family support. Aggregate pass-rate changes are small (-0.55 to -3.52 pp), yet 10-20% of prompts switch between pass and fail across modes, suggesting that thinking changes the pattern of errors--some prompts improve while others worsen--rather than uniformly degrading performance. Under a post-hoc Qwen3-derived grouping, constraint types separate into Planning (global counting, structure, coordination), which improves at the class level under thinking, and Precision (exact local form), which consistently worsens; the class-level Planning/Precision sign pattern holds directionally for all four Hunyuan models despite Hunyuan's opposite aggregate direction. Thinking also changes final-answer length; matched-length analyses substantially reduce the Precision drop, but a residual penalty remains. Analyzing thinking traces with a cross-encoder relevance metric reveals three patterns: Neutral shows a positive relevance-compliance link (r approximately 0.15); Planning shows near-zero predictive correlation (r approximately 0.02) despite measurable trace engagement, consistent with an execution gap between CE-measured trace relevance and final-answer compliance; Precision shows a small negative correlation (r approximately -0.05), with failing instances having higher mean relevance than passing ones. Activation patching across four model sizes (1.7B-14B) shows that Precision flip instances are more often restored than Planning flip instances (32-58% vs. 14-40% mean layer-restoration), with the largest gap at 14B (about 30 pp).

翻译：大型推理模型（LRMs）通常能提升数学与编程任务的表现，但其对指令遵循的影响尚不明确。我们采用Qwen3模型（1.7B-32B）研究IFEval，通过同权重“思维开启/关闭”控制实验，并结合四个混元模型提供跨系列的定向支持。聚合通过率变化幅度较小（-0.55至-3.52个百分点），但10-20%的提示在两种模式间发生通过/失败切换，表明思维改变了错误模式（部分提示改进而其他提示恶化），而非均匀降低性能。基于Qwen3的后验分组显示，约束类型可分为规划型（全局计数、结构、协调）和精确型（精确局部形式）：思维在类别层面改善了规划型约束，而精确型约束则持续恶化。尽管混元模型的聚合方向相反，但所有四个混元模型在类别层面的规划型/精确型符号模式保持方向一致性。思维也会改变最终答案长度；经长度匹配分析后，精确型约束的下降幅度显著减小，但仍存在残余惩罚。采用交叉编码器相关性指标分析思维轨迹，揭示三种模式：中性模式呈现正相关性-合规性关联（r≈0.15）；规划型模式虽存在可测量的轨迹参与度，但预测相关性接近零（r≈0.02），这与基于交叉编码器的轨迹相关性与最终答案合规性之间的执行差距一致；精确型模式呈现微小负相关性（r≈-0.05），其中失败实例的平均相关性高于成功实例。对四种模型规模（1.7B-14B）进行激活修补发现，精确型翻转实例的恢复成功率（32-58%）高于规划型翻转实例（14-40%），在14B模型上差距最大（约30个百分点）。