Improving reasoning abilities of Large Language Models (LLMs), especially under parameter constraints, is crucial for real-world applications. Looped transformers address this by performing multiple latent iterations to refine each token beyond a single forward pass. However, we identify a latent overthinking phenomenon: most token predictions are already correct after the first pass, but are sometimes revised into errors in later iterations. In this work, we ask whether selectively skipping latent iterations may improve accuracy. We reveal significant potential with an oracle iteration policy that boosts model performance by up to 7.3%. Motivated by this, we propose Think-at-Hard (TaH), a looped transformer optimized for selective iteration. TaH employs a lightweight neural decider to trigger latent iteration only at tokens that are likely incorrect after the standard forward pass. During latent iterations, depth-aware Low-Rank Adaptation (LoRA) modules shift the LLM's objective from general next-token prediction to focused hard-token refinement. A duo-causal attention mechanism extends attention from the token sequence dimension to an additional iteration depth dimension, enabling cross-iteration information flow with full sequential parallelism. Experiments on nine benchmarks show consistent gains across math, QA, and coding tasks. With identical parameter counts, TaH outperforms always-iterate baselines by 3.8-4.4% while skipping iterations on 93% of tokens, and exceeds single-iteration Qwen3 baselines by 3.0-3.8%. When allowing <3% more parameters from LoRA and decider modules, the gains further increase to 5.3-6.2% and 6.1-6.8%, respectively. Our code is available at https://github.com/thu-nics/TaH.
翻译:提升大语言模型(LLMs)的推理能力,特别是在参数受限条件下,对于实际应用至关重要。循环变换器通过执行多次隐层迭代,对每个词元进行超越单次前向传播的精炼。然而,我们发现一种“隐层过思考”现象:大多数词元的预测在首次前向传播后已正确,但在后续迭代中有时会被修正为错误答案。本研究探讨选择性跳过隐层迭代是否能提升准确率。我们通过一个理想化的迭代策略揭示了显著潜力,该策略可使模型性能提升高达7.3%。受此启发,我们提出“硬思考”(Think-at-Hard, TaH),一种针对选择性迭代优化的循环变换器。TaH采用轻量级神经决策器,仅在标准前向传播后可能错误的词元上触发隐层迭代。在隐层迭代过程中,深度感知的低秩适配(LoRA)模块将LLM的目标从通用下一个词元预测转向专注的困难词元精炼。一种双因果注意力机制将注意力从词元序列维度扩展到额外的迭代深度维度,实现跨迭代信息流并保持完整的序列并行性。在九个基准测试上的实验表明,该方法在数学、问答和编码任务中均取得一致增益。在参数量相同的情况下,TaH在跳过93%词元的迭代时,比始终迭代的基线模型提升3.8-4.4%,并超过单次迭代的Qwen3基线模型3.0-3.8%。当允许从LoRA和决策器模块增加不到3%的参数时,增益分别进一步提升至5.3-6.2%和6.1-6.8%。我们的代码开源于https://github.com/thu-nics/TaH。