Reconstructing dense 3D geometry from continuous video streams requires stable inference under a constant memory budget. Existing $O(1)$ frameworks primarily rely on a ``pure eviction'' paradigm, which suffers from significant information destruction due to binary token deletion and evaluation noise from localized, single-layer scoring. To address these bottlenecks, we propose StreamCacheVGGT, a training-free framework that reimagines cache management through two synergistic modules: Cross-Layer Consistency-Enhanced Scoring (CLCES) and Hybrid Cache Compression (HCC). CLCES mitigates activation noise by tracking token importance trajectories across the Transformer hierarchy, employing order-statistical analysis to identify sustained geometric salience. Leveraging these robust scores, HCC transcends simple eviction by introducing a three-tier triage strategy that merges moderately important tokens into retained anchors via nearest-neighbor assignment on the key-vector manifold. This approach preserves essential geometric context that would otherwise be lost. Extensive evaluations on five benchmarks (7-Scenes, NRGBD, ETH3D, Bonn, and KITTI) demonstrate that StreamCacheVGGT sets a new state-of-the-art, delivering superior reconstruction accuracy and long-term stability while strictly adhering to constant-cost constraints.
翻译:从连续视频流重建稠密三维几何结构需要在恒定内存预算下保持稳定推理。现有O(1)框架主要依赖"纯驱逐"范式,因二进制令牌删除导致严重信息破坏,且局部单层评分引入评估噪声。针对这些瓶颈,我们提出无需训练的StreamCacheVGGT框架,通过两个协同模块重新构想缓存管理:跨层一致性增强评分(CLCES)与混合缓存压缩(HCC)。CLCES通过追踪令牌重要性在Transformer层级间的演变轨迹来缓解激活噪声,采用序统计分析方法识别持续性几何显著性。基于这些鲁棒评分,HCC超越简单驱逐机制,引入三级分诊策略,通过键向量流形上的最近邻分配将中等重要性令牌合并至保留锚点。该方法有效保留了本可能丢失的关键几何上下文。在五个基准(7-Scenes、NRGBD、ETH3D、Bonn、KITTI)上的广泛评估表明,StreamCacheVGGT在严格遵循恒定成本约束的同时,实现了更优的重建精度与长期稳定性,树立了新的最优水平。