Improving the reasoning abilities of Large Language Models (LLMs), especially under parameter constraints, is crucial for real-world applications. Looped transformers address this by performing multiple latent iterations to refine each token beyond a single forward pass. However, we identify a latent overthinking phenomenon: most token predictions are already correct after the first pass, but are sometimes revised into errors in later iterations. We ask whether selectively skipping latent iterations can improve accuracy, and reveal significant potential with an oracle iteration policy that boosts performance by up to 7.3%. Motivated by this, we propose Think-at-Hard (TaH), a looped transformer optimized for selective iteration. TaH employs a lightweight neural decider to trigger latent iteration, only at tokens likely to be incorrect after the standard forward pass. During latent iterations, depth-aware Low-Rank Adaptation (LoRA) modules shift the objective from general next-token prediction to focused hard-token refinement. A duo-causal attention mechanism extends attention from the token sequence dimension to an additional iteration depth dimension, enabling cross-iteration information flow with full sequential parallelism. Experiments on nine benchmarks show consistent gains across math, QA, and coding tasks. With identical parameter counts, TaH outperforms always-iterate baselines by 3.8-4.4% while skipping iterations on 93% of tokens, and exceeds single-iteration Qwen3 baselines by 3.0-3.8%. When allowing <3% more parameters from LoRA and decider, the gains further increase to 5.3-6.2% and 6.1-6.8%, respectively. Our code is available at https://github.com/thu-nics/TaH.
翻译:提升大语言模型(LLMs)的推理能力,特别是在参数受限条件下,对实际应用至关重要。循环Transformer通过执行多次潜在迭代,对每个词元进行超越单次前向传播的精炼,从而解决该问题。然而,我们识别出一种"潜在过思考"现象:大多数词元预测在首次前向传播后已正确,但在后续迭代中有时会被修正为错误。我们探究了选择性跳过潜在迭代能否提升准确率,并通过一种理想化迭代策略揭示了高达7.3%的性能提升潜力。受此启发,我们提出了Think-at-Hard(TaH),一种针对选择性迭代优化的循环Transformer。TaH采用轻量级神经决策器,仅在标准前向传播后可能出错的词元处触发潜在迭代。在潜在迭代过程中,深度感知的低秩适配(LoRA)模块将目标从通用下一个词元预测转变为聚焦难例词元的精炼。一种双因果注意机制将注意力从词元序列维度扩展到额外的迭代深度维度,在实现完全顺序并行化的同时支持跨迭代信息流动。在九个基准测试上的实验表明,该方法在数学、问答和代码任务上均取得了一致提升。在参数数量相同的情况下,TaH在跳过93%词元的迭代时,性能比始终迭代的基线模型提升3.8-4.4%,并超过单次迭代的Qwen3基线模型3.0-3.8%。当允许LoRA和决策器增加<3%的参数时,增益进一步提升至5.3-6.2%和6.1-6.8%。我们的代码开源在https://github.com/thu-nics/TaH。