Humans perceive and understand real-world spaces through a stream of visual observations. Therefore, the ability to streamingly maintain and update spatial evidence from potentially unbounded video streams is essential for spatial intelligence. The core challenge is not simply longer context windows but how spatial information is selected, organized, and retained over time. In this paper, we propose Spatial-TTT towards streaming visual-based spatial intelligence with test-time training (TTT), which adapts a subset of parameters (fast weights) to capture and organize spatial evidence over long-horizon scene videos. Specifically, we design a hybrid architecture and adopt large-chunk updates parallel with sliding-window attention for efficient spatial video processing. To further promote spatial awareness, we introduce a spatial-predictive mechanism applied to TTT layers with 3D spatiotemporal convolution, which encourages the model to capture geometric correspondence and temporal continuity across frames. Beyond architecture design, we construct a dataset with dense 3D spatial descriptions, which guides the model to update its fast weights to memorize and organize global 3D spatial signals in a structured manner. Extensive experiments demonstrate that Spatial-TTT improves long-horizon spatial understanding and achieves state-of-the-art performance on video spatial benchmarks. Project page: https://liuff19.github.io/Spatial-TTT.
翻译:人类通过连续的视觉观察来感知和理解真实世界空间。因此,从可能无限长的视频流中持续维护和更新空间证据的能力,对于空间智能至关重要。核心挑战不仅在于更长的上下文窗口,更在于如何随时间选择、组织和保留空间信息。本文提出Spatial-TTT,旨在通过测试时训练实现基于流式视觉的空间智能。该方法通过调整部分参数(快速权重)来捕获和组织长时序场景视频中的空间证据。具体而言,我们设计了一种混合架构,采用大块更新与滑动窗口注意力并行处理,以实现高效的空间视频处理。为进一步提升空间感知能力,我们在TTT层中引入了一种结合3D时空卷积的空间预测机制,促使模型捕获跨帧的几何对应关系和时序连续性。除架构设计外,我们还构建了一个包含密集3D空间描述的数据集,指导模型更新其快速权重,以结构化方式记忆和组织全局3D空间信号。大量实验表明,Spatial-TTT提升了长时序空间理解能力,并在视频空间基准测试中取得了最先进的性能。项目页面:https://liuff19.github.io/Spatial-TTT。