Language models (LMs) have shown impressive performance on tasks within their training distribution, but often struggle with structurally novel tasks even when given a small number of in-context task examples. We investigate the effectiveness of test-time training (TTT) -- temporarily updating model parameters during inference using a loss derived from input data -- as a mechanism for improving LMs' reasoning and few-shot learning capabilities. On the Abstraction and Reasoning Corpus (ARC), performing TTT with in-context examples yields up to $6\times$ higher accuracy compared to fine-tuned baselines -- reaching $53.0\%$ on the public validation set with an 8B-parameter LM and $61.9\%$ when ensembled with program-synthesis methods, matching average human performance. On BIG-Bench Hard (BBH), TTT on in-context examples surpasses standard few-shot prompting in the $10$-shot setting by $7.3$ percentage points ($50.5\%$ to $57.8\%$). Our findings highlight the limitations of in-context learning for novel tasks and demonstrate the potential of test-time training to enhance language model adaptability.
翻译:语言模型(LMs)在其训练分布内的任务上表现出色,但面对结构新颖的任务时,即使提供少量上下文任务示例,往往仍难以应对。本研究探讨了测试时训练(TTT)——在推理过程中利用输入数据衍生的损失函数临时更新模型参数——作为提升语言模型推理与少样本学习能力的机制的有效性。在抽象与推理语料库(ARC)上,结合上下文示例进行TTT相比微调基线实现了高达$6\times$的准确率提升——使用80亿参数语言模型在公开验证集上达到$53.0\%$的准确率,与程序合成方法集成后更提升至$61.9\%$,达到人类平均表现水平。在BIG-Bench Hard(BBH)基准测试中,基于上下文示例的TTT在$10$-样本设定下超越标准少样本提示方法$7.3$个百分点(从$50.5\%$提升至$57.8\%$)。我们的研究揭示了上下文学习在处理新颖任务时的局限性,并证明了测试时训练在增强语言模型适应性方面的潜力。