Efficient and accurate feed-forward multi-view reconstruction has long been an important task in computer vision. Recent transformer-based models like VGGT, $π^3$ and MapAnything have demonstrated remarkable performance with relatively simple architectures. However, their scalability is fundamentally constrained by the quadratic complexity of global attention, which imposes a significant runtime bottleneck when processing large image sets. In this work, we empirically analyze the global attention matrix of these models and observe that the probability mass concentrates on a small subset of patch-patch interactions corresponding to cross-view geometric correspondences. Building on this insight and inspired by recent advances in large language models, we propose a training-free, block-sparse replacement for dense global attention, implemented with highly optimized kernels. Our method accelerates inference by more than $3\times$ while maintaining comparable task performance. Evaluations on a comprehensive suite of multi-view benchmarks demonstrate that our approach seamlessly integrates into existing global attention-based architectures such as VGGT, $π^3$ , and MapAnything, while substantially improving scalability to large image collections.
翻译:高效准确的前馈式多视图重建长期以来一直是计算机视觉领域的重要任务。近期基于Transformer的模型(如VGGT、$π^3$和MapAnything)通过相对简单的架构展现出卓越性能。然而,其可扩展性从根本上受限于全局注意力的二次复杂度——这在处理大规模图像集时造成了显著的运行时间瓶颈。本文通过实证分析这些模型的全局注意力矩阵,观察到概率质量集中于对应跨视图几何关系的少数补丁-补丁交互子集。基于此发现,并借鉴大型语言模型的最新进展,我们提出一种无需训练的块稀疏替换方法,以替代密集全局注意力,并通过高度优化的内核实现。该方法在维持相当任务性能的同时,将推理速度提升3倍以上。在多视图基准测试的综合评估表明,我们的方法可无缝集成至VGGT、$π^3$和MapAnything等现有基于全局注意力的架构中,同时显著增强其对大型图像集合的可扩展性。