In-Place Test-Time Training

The static ``train then deploy" paradigm fundamentally limits Large Language Models (LLMs) from dynamically adapting their weights in response to continuous streams of new information inherent in real-world tasks. Test-Time Training (TTT) offers a compelling alternative by updating a subset of model parameters (fast weights) at inference time, yet its potential in the current LLM ecosystem is hindered by critical barriers including architectural incompatibility, computational inefficiency and misaligned fast weight objectives for language modeling. In this work, we introduce In-Place Test-Time Training (In-Place TTT), a framework that seamlessly endows LLMs with Test-Time Training ability. In-Place TTT treats the final projection matrix of the ubiquitous MLP blocks as its adaptable fast weights, enabling a ``drop-in" enhancement for LLMs without costly retraining from scratch. Furthermore, we replace TTT's generic reconstruction objective with a tailored, theoretically-grounded objective explicitly aligned with the Next-Token-Prediction task governing autoregressive language modeling. This principled objective, combined with an efficient chunk-wise update mechanism, results in a highly scalable algorithm compatible with context parallelism. Extensive experiments validate our framework's effectiveness: as an in-place enhancement, it enables a 4B-parameter model to achieve superior performance on tasks with contexts up to 128k, and when pretrained from scratch, it consistently outperforms competitive TTT-related approaches. Ablation study results further provide deeper insights on our design choices. Collectively, our results establish In-Place TTT as a promising step towards a paradigm of continual learning in LLMs.

翻译：静态的“先训练后部署”范式从根本上限制了大型语言模型（LLMs）动态调整其权重以应对现实任务中持续不断的新信息流。测试时训练（TTT）通过推理时更新部分模型参数（快速权重）提供了一种引人注目的替代方案，但在当前LLM生态系统中，其潜力受到关键障碍的制约，包括架构不兼容、计算效率低下以及快速权重目标与语言建模的不匹配。在本文中，我们提出了原位测试时训练（In-Place TTT），一个无缝赋予LLMs测试时训练能力的框架。In-Place TTT将普遍存在的MLP模块的最终投影矩阵作为其可适应的快速权重，从而无需从头进行昂贵的重新训练即可实现LLMs的“即插即用”增强。此外，我们用一种专门设计的、基于理论的目标替代了TTT的通用重构目标，该目标明确与支配自回归语言建模的下一词元预测任务对齐。这一原则性目标，结合高效的词块更新机制，产生了一种与上下文并行性兼容的高度可扩展算法。大量实验验证了我们框架的有效性：作为原位增强，它使一个40亿参数的模型在长达128k上下文的任务上表现出色；而从头预训练时，它始终优于竞争性的TTT相关方法。消融实验结果进一步为我们设计选择提供了更深入的见解。总体而言，我们的结果确立了In-Place TTT作为迈向LLMs持续学习范式的有前景的一步。