We propose Confidence-Guided Token Merging (Co-Me), an acceleration mechanism for visual geometric transformers without retraining or finetuning the base model. Co-Me distilled a light-weight confidence predictor to rank tokens by uncertainty and selectively merge low-confidence ones, effectively reducing computation while maintaining spatial coverage. Compared to similarity-based merging or pruning, the confidence signal in Co-Me reliably indicates regions emphasized by the transformer, enabling substantial acceleration without degrading performance. Co-Me applies seamlessly to various multi-view and streaming visual geometric transformers, achieving speedups that scale with sequence length. When applied to VGGT and Pi3, Co-Me achieves up to 21.5x and 20.4x speedup, making visual geometric transformers practical for real-time 3D perception and reconstruction.
翻译:我们提出置信度引导令牌合并(Confidence-Guided Token Merging,Co-Me)方法,这是一种无需重新训练或微调基础模型即可加速视觉几何变换器的机制。Co-Me 通过蒸馏轻量级置信度预测器,按不确定性对令牌进行排序,并选择性合并低置信度令牌,在保持空间覆盖的同时有效降低计算量。与基于相似性的合并或剪枝相比,Co-Me 中的置信度信号能够可靠地指示变换器关注的区域,从而在不降低性能的情况下实现大幅加速。Co-Me 可无缝应用于多种多视图和流式视觉几何变换器,其加速比随序列长度扩展。在 VGGT 和 Pi3 上应用时,Co-Me 分别实现了最高 21.5 倍和 20.4 倍的加速,使得视觉几何变换器在实时 3D 感知与重建中具备实用性。