Intention Collapse: Intention-Level Metrics for Reasoning in Language Models

Language generation maps a rich, high-dimensional internal state to a single token sequence. We study this many-to-one mapping through the lens of intention collapse: the projection from an internal intention space I to an external language space L. We introduce three cheap, model-agnostic metrics computed on a pre-collapse state I: (i) intention entropy Hint(I), (ii) effective dimensionality deff(I), and (iii) recoverability Recov(I), operationalized as probe AUROC for predicting eventual success. We evaluate these metrics in a 3x3 study across models (Mistral-7B, LLaMA-3.1-8B, Qwen-2.5-7B) and benchmarks (GSM8K, ARC-Challenge, AQUA-RAT), comparing baseline, chain-of-thought (CoT), and a babble control (n=200 items per cell). CoT increases average accuracy from 34.2% to 47.3% (+13.1 pp), driven by large gains on GSM8K but consistent degradations on ARC-Challenge. Across models, CoT induces distinct entropy regimes relative to baseline, dH = Hint(CoT) - Hint(Base): Mistral shows dH < 0 (lower-entropy CoT), whereas LLaMA shows dH > 0 (higher-entropy CoT), highlighting heterogeneity in CoT-induced internal uncertainty. Finally, probe AUROC is significantly above chance in a subset of settings and can dissociate from behavioral accuracy (e.g., high AUROC alongside lower CoT accuracy on ARC-Challenge for Qwen), suggesting that informative internal signal is not always reliably converted into a final discrete decision under constrained response formats.

翻译：语言生成将丰富的高维内部状态映射为单一的词元序列。我们通过意图坍缩的视角研究这种多对一映射：从内部意图空间 I 到外部语言空间 L 的投影。我们引入了三个在坍缩前状态 I 上计算的、廉价且模型无关的度量指标：(i) 意图熵 Hint(I)，(ii) 有效维度 deff(I)，以及 (iii) 可恢复性 Recov(I)，其操作化定义为预测最终成功概率的探针 AUROC。我们在一个 3x3 研究中评估这些指标，涵盖模型（Mistral-7B、LLaMA-3.1-8B、Qwen-2.5-7B）和基准测试（GSM8K、ARC-Challenge、AQUA-RAT），比较基线、思维链（CoT）和一个随机生成控制条件（每个单元 n=200 个项目）。CoT 将平均准确率从 34.2% 提升至 47.3%（+13.1 个百分点），这主要得益于在 GSM8K 上的大幅增益，但在 ARC-Challenge 上则出现了一致的性能下降。在不同模型中，CoT 相对于基线诱导出不同的熵机制，dH = Hint(CoT) - Hint(Base)：Mistral 显示 dH < 0（更低熵的 CoT），而 LLaMA 显示 dH > 0（更高熵的 CoT），这突显了 CoT 诱导的内部不确定性的异质性。最后，探针 AUROC 在部分设置中显著高于随机水平，并且可以与行为准确率分离（例如，Qwen 在 ARC-Challenge 上具有较高的 AUROC 但 CoT 准确率较低），这表明在受限的响应格式下，信息丰富的内部信号并不总是被可靠地转化为最终的离散决策。