One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness

Instruction-tuned large language models produce helpful, structured responses, but how robust is this helpfulness under trivial constraints? We show that simple lexical constraints (banning a single punctuation character or common word) cause instruction-tuned LLMs to collapse their responses, losing 14--48\% of comprehensiveness across seven models spanning five families (7B--70B, open- and closed-weight). A blinded human evaluation with 10 STEM-trained evaluators confirms genuine content loss, with information criteria degrading $1.5$--$2.3\times$ more than surface criteria, a finding corroborated by over 4,100 automated pairwise comparisons (77--100\% baseline preference) across three LLM judges from two model families. Diagnostic analysis identifies this as a \emph{planning failure}: two-pass generation recovers 59--96\% of response length, and linear probes on prompt representations predict response length with $R^2 = 0.51$--$0.94$ before generation begins. The same probes yield negative $R^2$ on base models, confirming that instruction tuning introduces the representational structure underlying the collapse. Base models show no systematic degradation under identical constraints, demonstrating that instruction tuning couples task competence to narrow surface-form templates. The effect extends to realistic deployment constraints (preamble suppression, corporate tone guidelines, legal compliance hedging, accessibility requirements) causing comparable degradation ($-$22\% to $-$34\%), with suppressing the conversational opener alone (``Certainly!'') causing 40\% collapse on our most fragile model despite restricting only the opening tokens. We further show that standard independent LLM-as-judge evaluation detects only a 3.5\% quality drop where pairwise evaluation reveals 23\%, exposing a methodological blind spot in current evaluation practice.

翻译：指令调优的大语言模型能生成有帮助、结构化的回应，但在琐碎约束下，这种帮助性的稳健性如何？我们表明，简单的词汇约束（禁止单个标点字符或常见单词）会导致指令调优的LLM回应崩溃，在跨越五个家族（7B–70B，开源和闭源权重）的七个模型中，综合性的损失达14–48%。一项由10位STEM训练评估者进行的盲人人类评估证实了真实的内容损失，信息标准的退化比表面标准高出1.5–2.3倍，这一发现得到了来自两个模型家族的三个LLM评判者进行的超过4100次自动配对比较（77–100%的基线偏好）的佐证。诊断分析将此识别为一种“规划失败”：两步生成恢复了59–96%的回应长度，且在生成开始前，对提示表示的线性探针预测回应长度的R²为0.51–0.94。同样的探针在基础模型上给出负的R²，证实指令调优引入了崩溃背后的表示结构。基础模型在相同约束下未显示系统性退化，证明指令调优将任务能力耦合到狭窄的表面形式模板。该效应扩展到现实部署约束（前言抑制、企业语气指南、法律合规对冲、可访问性要求），导致类似退化（–22%到–34%），仅抑制对话开场白（“当然！”）就导致我们最脆弱模型上40%的崩溃，尽管仅限制开头标记。我们进一步表明，标准的独立LLM-as-judge评估仅检测到3.5%的质量下降，而配对评估揭示23%，暴露了当前评估实践中的方法论盲点。