Prior work has established test-time training (TTT) as a general framework to further improve a trained model at test time. Before making a prediction on each test instance, the model is trained on the same instance using a self-supervised task, such as image reconstruction with masked autoencoders. We extend TTT to the streaming setting, where multiple test instances - video frames in our case - arrive in temporal order. Our extension is online TTT: The current model is initialized from the previous model, then trained on the current frame and a small window of frames immediately before. Online TTT significantly outperforms the fixed-model baseline for four tasks, on three real-world datasets. The relative improvement is 45% and 66% for instance and panoptic segmentation. Surprisingly, online TTT also outperforms its offline variant that accesses more information, training on all frames from the entire test video regardless of temporal order. This differs from previous findings using synthetic videos. We conceptualize locality as the advantage of online over offline TTT. We analyze the role of locality with ablations and a theory based on bias-variance trade-off.
翻译:先前研究已确立测试时训练(TTT)作为一种通用框架,可在测试阶段进一步提升已训练模型的性能。在对每个测试实例进行预测前,该模型会利用自监督任务(如基于掩码自编码器的图像重建)在同一实例上进行训练。我们将TTT扩展至流式场景——其中多个测试实例(以视频帧为例)按时间顺序依次抵达。本扩展方法为在线TTT:当前模型由前序模型初始化,随后在当前帧及紧邻的小窗口帧序列上进行训练。在三个真实世界数据集的四项任务中,在线TTT显著优于固定模型基线。对于实例分割和全景分割任务,其相对提升幅度分别达45%和66%。出人意料的是,在线TTT甚至优于可访问更多信息的离线变体——后者无论时间顺序如何,均对整个测试视频的所有帧进行训练。这一发现与先前基于合成视频的研究结论存在差异。我们将局部性概念化为在线TTT相较于离线TTT的优势所在,并通过消融实验及基于偏差-方差权衡的理论分析,深入探究了局部性的作用机制。