Reconstructing dense 3D geometry from continuous video streams requires stable inference under a constant memory budget. Existing $O(1)$ frameworks primarily rely on a ``pure eviction'' paradigm, which suffers from significant information destruction due to binary token deletion and evaluation noise from localized, single-layer scoring. To address these bottlenecks, we propose StreamCacheVGGT, a training-free framework that reimagines cache management through two synergistic modules: Cross-Layer Consistency-Enhanced Scoring (CLCES) and Hybrid Cache Compression (HCC). CLCES mitigates activation noise by tracking token importance trajectories across the Transformer hierarchy, employing order-statistical analysis to identify sustained geometric salience. Leveraging these robust scores, HCC transcends simple eviction by introducing a three-tier triage strategy that merges moderately important tokens into retained anchors via nearest-neighbor assignment on the key-vector manifold. This approach preserves essential geometric context that would otherwise be lost. Extensive evaluations on five benchmarks (7-Scenes, NRGBD, ETH3D, Bonn, and KITTI) demonstrate that StreamCacheVGGT sets a new state-of-the-art, delivering superior reconstruction accuracy and long-term stability while strictly adhering to constant-cost constraints.
翻译:从连续视频流重建稠密三维几何结构需要在恒定内存预算下进行稳定推理。现有$O(1)$框架主要采用"纯逐出"范式,该范式因二元令牌删除导致严重信息破坏,且受限于局部单层评分引入的评估噪声。针对这些瓶颈,我们提出StreamCacheVGGT——一种免训练框架,通过两个协同模块重新构想缓存管理:跨层一致性增强评分(CLCES)与混合缓存压缩(HCC)。CLCES通过追踪令牌在Transformer层级间的显著性轨迹,采用次序统计分析识别持续存在的几何显著性,从而缓解激活噪声。基于这些稳健评分,HCC摒弃简单的逐出策略,引入三级分类机制,通过键向量流形上的最近邻分配将中等重要性令牌合并至保留锚点。该方法保留了否则将丢失的关键几何上下文。在五个基准数据集(7-Scenes、NRGBD、ETH3D、Bonn及KITTI)上的广泛评估表明,StreamCacheVGGT在严格遵循恒定成本约束的同时,以更优的重建精度与长期稳定性树立了新的技术标杆。