Online 3D reconstruction from streaming inputs requires both long-term temporal consistency and efficient memory usage. Although causal variants of VGGT address this challenge through a key-value (KV) cache mechanism, the cache grows linearly with the stream length, creating a major memory bottleneck. Under limited memory budgets, early cache eviction significantly degrades reconstruction quality and temporal consistency. In this work, we observe that attention in causal transformers for 3D reconstruction exhibits intrinsic spatio-temporal sparsity. Based on this insight, we propose STAC, a Spatio-Temporally Aware Cache Compression framework for streaming 3D reconstruction with large causal transformers. STAC consists of three key components: (1) a Working Temporal Token Caching mechanism that preserves long-term informative tokens using decayed cumulative attention scores; (2) a Long-term Spatial Token Caching scheme that compresses spatially redundant tokens into voxel-aligned representations for memory-efficient storage; and (3) a Chunk-based Multi-frame Optimization strategy that jointly processes consecutive frames to improve temporal coherence and GPU efficiency. Extensive experiments show that STAC achieves state-of-the-art reconstruction quality while reducing memory consumption by nearly 10x and accelerating inference by 4x, substantially improving the scalability of real-time 3D reconstruction in streaming settings.
翻译:流式输入在线三维重建需要在长期时间一致性及高效内存使用间取得平衡。尽管VGGT的因果变体通过键值(KV)缓存机制应对这一挑战,但缓存大小随流长度线性增长,成为主要内存瓶颈。在有限内存预算下,早期缓存驱逐会显著降低重建质量与时间一致性。本工作观察到,用于三维重建的因果Transformer注意力呈现内在的时空稀疏性。基于此洞察,我们提出STAC——一种面向大规模因果Transformer流式三维重建的时空感知缓存压缩框架。STAC包含三个关键组件:(1)工作时序令牌缓存机制,利用衰减累积注意力分数保留长期信息性令牌;(2)长程空间令牌缓存方案,将空间冗余令牌压缩为体素对齐表示以实现内存高效存储;(3)基于分块的多帧优化策略,通过联合处理连续帧提升时间连贯性与GPU效率。大量实验表明,STAC在实现最先进重建质量的同时,将内存消耗降低近10倍、推理速度提升4倍,显著增强了实时三维重建在流式场景中的可扩展性。