ROAR-3D: Routing Arbitrary Views for High-Fidelity 3D Generation

Single-image-to-3D generative models can now produce high-quality geometry, yet conditioning on a single view inevitably introduces ambiguity about unseen regions. Multi-view conditioning can reduce this ambiguity, but existing methods either require fixed canonical viewpoints or rely on external reconstruction modules that impose heavy training costs and limit generation quality. We observe that pretrained single-view models already possess strong 2D-to-3D grounding that can be reused for multi-view conditioning. However, a closer analysis reveals that their conditioning mechanism entangles orientation control with geometry transfer, two functions that conflict when images from different viewpoints are naively combined. Based on this analysis, we propose ROAR-3D, a lightweight method that upgrades a pretrained single-view model to accept an arbitrary number of unposed images. A token-wise view router assigns each 3D latent token to its most relevant view, implicitly establishing 2D-to-3D correspondences without explicit pose input. A dual-stream attention design preserves the pretrained primary-view behavior while routing auxiliary views through a separate path dedicated to geometric enrichment. An orientation perturbation strategy ensures the auxiliary path learns orientation-independent geometry transfer. These components introduce minimal trainable parameters and add negligible inference overhead relative to the single-view baseline. ROAR-3D achieves state-of-the-art multi-view 3D generation quality and supports test-time view scaling from 1 to 12+ views with consistent improvements.

翻译：单图到三维生成模型现已能够产生高质量几何结构，但单视角条件约束不可避免地导致对未观测区域存在歧义性。多视角条件约束可减少这种歧义，但现有方法要么需要固定标准视角，要么依赖外部重建模块，导致训练成本高昂且制约生成质量。我们观察到，预训练单视角模型已具备强大的二维到三维先验知识，可复用于多视角条件约束。然而进一步分析发现，其条件约束机制将方向控制与几何迁移两种功能纠缠在一起——当来自不同视角的图像被简单组合时，这两种功能会产生冲突。基于此分析，我们提出ROAR-3D，一种轻量化方法，将预训练单视角模型升级为可接受任意数量未标定姿态图像。通过令牌级视角路由器，每个三维隐变量令牌被分配至最相关视角，无需显式姿态输入即可隐式建立二维到三维对应关系。双流注意力机制在保留预训练主视角行为的同时，通过专用几何增强路径路由辅助视角。方向扰动策略确保辅助路径学习与方向无关的几何迁移。这些组件引入极少量可训练参数，相比单视角基线仅增加可忽略的推理开销。ROAR-3D实现了最先进的多视角三维生成质量，支持从1到12+视角的测试时视角扩展，且性能持续提升。