We propose VIAFormer, a Voxel-Image Alignment Transformer model designed for Multi-view Conditioned Voxel Refinement--the task of repairing incomplete noisy voxels using calibrated multi-view images as guidance. Its effectiveness stems from a synergistic design: an Image Index that provides explicit 3D spatial grounding for 2D image tokens, a Correctional Flow objective that learns a direct voxel-refinement trajectory, and a Hybrid Stream Transformer that enables robust cross-modal fusion. Experiments show that VIAFormer establishes a new state of the art in correcting both severe synthetic corruptions and realistic artifacts on the voxel shape obtained from powerful Vision Foundation Models. Beyond benchmarking, we demonstrate VIAFormer as a practical and reliable bridge in real-world 3D creation pipelines, paving the way for voxel-based methods to thrive in large-model, big-data wave.
翻译:我们提出VIAFormer,一种体素-图像对齐Transformer模型,专为多视角条件体素细化任务而设计——该任务旨在利用校准的多视角图像作为指导来修复不完整且有噪声的体素。其有效性源于协同设计:为二维图像令牌提供显式三维空间定位的图像索引、学习直接体素细化轨迹的校正流目标,以及实现鲁棒跨模态融合的混合流Transformer。实验表明,VIAFormer在校正严重合成损坏与从强大视觉基础模型获得的体素形状中的真实伪影方面均确立了新的技术标杆。除基准测试外,我们证明了VIAFormer可作为实际三维创作流程中实用可靠的桥梁,为基于体素的方法在大型模型与大数据浪潮中蓬勃发展铺平道路。