The Visual Geometry Grounded Transformer (VGGT) enables strong feed-forward 3D reconstruction without per-scene optimization. However, its billion-parameter scale creates high memory and compute demands, hindering on-device deployment. Existing LLM quantization methods fail on VGGT due to saturated activation channels and diverse 3D semantics, which cause unreliable calibration. Furthermore, VGGT presents hardware challenges regarding precision-sensitive nonlinear operators and memory-intensive global attention. To address this, we propose VersaQ-3D, an algorithm-architecture co-design framework. Algorithmically, we introduce the first calibration-free, scene-agnostic quantization for VGGT down to 4-bit, leveraging orthogonal transforms to decorrelate features and suppress outliers. Architecturally, we design a reconfigurable accelerator supporting BF16, INT8, and INT4. A unified systolic datapath handles both linear and nonlinear operators, reducing latency by 60%, while two-stage recomputation-based tiling alleviates memory pressure for long-sequence attention. Evaluations show VersaQ-3D preserves 98-99% accuracy at W4A8. At W4A4, it outperforms prior methods by 1.61x-2.39x across diverse scenes. The accelerator delivers 5.2x-10.8x speedup over edge GPUs with low power, enabling efficient instant 3D reconstruction.
翻译:视觉几何基础Transformer(VGGT)能够实现强大的前馈式三维重建,而无需进行逐场景优化。然而,其数十亿参数的规模带来了高内存和高计算需求,阻碍了在设备上的部署。现有的LLM量化方法在VGGT上失效,原因是激活通道饱和以及多样的三维语义导致校准不可靠。此外,VGGT在硬件层面面临挑战,包括对精度敏感的非线性算子和内存密集型的全局注意力机制。为解决这些问题,我们提出了VersaQ-3D,一个算法-架构协同设计框架。在算法层面,我们首次为VGGT引入了无需校准、与场景无关的量化方法,可低至4位,利用正交变换来解相关特征并抑制异常值。在架构层面,我们设计了一个支持BF16、INT8和INT4的可重构加速器。一个统一的脉动数据路径同时处理线性和非线性算子,将延迟降低了60%,而基于两阶段重计算的平铺策略则缓解了长序列注意力机制的内存压力。评估结果表明,VersaQ-3D在W4A8配置下保持了98-99%的精度。在W4A4配置下,其在多样场景中的性能优于先前方法1.61倍至2.39倍。该加速器相较于边缘GPU实现了5.2倍至10.8倍的加速,且功耗低,能够实现高效的即时三维重建。