One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness

Instruction-tuned large language models produce helpful, structured responses, but how robust is this helpfulness when trivially constrained? We show that simple lexical constraints (banning a single punctuation character or common word) cause instruction-tuned LLMs to collapse their responses, losing 14--48% of comprehensiveness in pairwise evaluation across three open-weight model families and one closed-weight model (GPT-4o-mini). The baseline response is preferred in 77--100% of 1,920 pairwise comparisons judged by GPT-4o-mini and GPT-4o. Notably, GPT-4o-mini suffers 31% comprehensiveness loss (99% baseline win rate), demonstrating that the fragility extends to commercially deployed closed-weight models, contrary to prior findings on format-level constraints. Through mechanistic analysis, we identify this as a planning failure: two-pass generation (free generation followed by constrained rewriting) recovers 59--96% of response length, and linear probes on prompt representations predict response length with $R^2 = 0.51$--$0.93$ before generation begins, with $R^2$ tracking collapse severity across models. The same probes yield negative $R^2$ on base models, confirming that instruction tuning creates the representational structure encoding the collapse decision. Crucially, base models show no systematic collapse under identical constraints, with effects that are small, noisy, and bidirectional, demonstrating that instruction tuning creates this fragility by coupling task competence to narrow surface-form templates. The effect replicates on MT-Bench across all eight task categories. We further show that standard independent LLM-as-judge evaluation detects only a 3.5% average quality drop where pairwise evaluation reveals 23%, exposing a methodological blind spot in how constrained generation is assessed.

翻译：经过指令微调的大型语言模型能够生成有用且结构清晰的回复，但这种有用性在面临简单的强制约束时究竟有多稳健？我们证明，简单的词汇约束（禁用单个标点符号或常见词汇）会导致经过指令微调的LLM回复崩溃，在跨三个开源模型家族和一个闭源模型（GPT-4o-mini）的成对评估中，全面性损失14%至48%。在GPT-4o-mini和GPT-4o评判的1920组成对比较中，基线回复在77%至100%的情况下更受青睐。值得注意的是，GPT-4o-mini遭受了31%的全面性损失（基线胜率99%），这表明这种脆弱性也波及了商业部署的闭源模型，这与先前关于格式层面约束的发现相反。通过机制分析，我们将其识别为规划失败：两阶段生成（自由生成后接约束重写）能恢复59%至96%的回复长度，并且对提示表征的线性探针能在生成开始前以$R^2 = 0.51$至$0.93$预测回复长度，其中$R^2$值在不同模型上追踪崩溃严重程度。同样的探针在基座模型上产生负$R^2$，证实指令微调创建了编码崩溃决策的表征结构。至关重要的是，基座模型在相同约束下未表现出系统性崩溃，其影响微小、嘈杂且双向，证明指令微调通过将任务能力与狭窄的表层格式模板耦合而创造了这种脆弱性。该效应在MT-Bench上跨全部八个任务类别得到复现。我们进一步证明，标准的独立LLM作为评判者的评估仅检测到平均3.5%的质量下降，而成对评估显示出23%的差异，这暴露了约束生成评估方法中的一个盲点。