Current approaches for scaling inference-time compute in transformers train them to emit explicit chain-of-thought tokens before producing an answer. While these methods are powerful, they are limited because they cannot be applied during pretraining and rely solely on serially-generated, natural-language verbalization. In this work, we propose Thoughtbubbles, a transformer variant that natively performs parallel adaptive computation in latent space by learning to fork or delete residual streams. Thus, tokens requiring more computation can form a "bubble" of cloned residuals in the middle of the network. Crucially, this behavior is learned during pretraining with only language modeling loss. Using half of the training budget, Thoughtbubbles outperforms the perplexity and zero-shot evals of both standard decoder LMs and those using non-adaptive parallel computation approaches. These results hold across model sizes from 150M to 1.9B. Thoughtbubbles achieves competitive GSM8K results using half of the baseline's token budget. The implicit nature of our method enables models to begin learning adaptive computation at pretraining time, paving the way to unified train-time and test-time scaling behaviors.
翻译:当前扩展Transformer推理时计算能力的方法通常训练模型在生成答案前输出显式的思维链标记。尽管这些方法效果显著,但其局限性在于无法应用于预训练阶段,且完全依赖串行生成的自然语言表述。本研究提出Thoughtbubbles——一种通过学会分叉或删除残差流,在潜在空间中本征执行并行自适应计算的Transformer变体。该机制允许需要更多计算的标记在网络中部形成克隆残差的"气泡"。关键在于,此行为仅通过语言建模损失在预训练过程中习得。在仅使用一半训练资源的条件下,Thoughtbubbles在困惑度和零样本评估上均优于标准解码器语言模型及采用非自适应并行计算方法的模型。该优势在1.5亿至19亿参数规模范围内均保持成立。Thoughtbubbles仅使用基线模型一半的标记预算即获得具有竞争力的GSM8K评测结果。本方法的隐式特性使模型能在预训练阶段开始学习自适应计算,为统一训练时与测试时的扩展行为开辟了新路径。