Current 3D reconstruction techniques struggle to infer unbounded scenes from a few images faithfully. Specifically, existing methods have high computational demands, require detailed pose information, and cannot reconstruct occluded regions reliably. We introduce 6Img-to-3D, an efficient, scalable transformer-based encoder-renderer method for single-shot image to 3D reconstruction. Our method outputs a 3D-consistent parameterized triplane from only six outward-facing input images for large-scale, unbounded outdoor driving scenarios. We take a step towards resolving existing shortcomings by combining contracted custom cross- and self-attention mechanisms for triplane parameterization, differentiable volume rendering, scene contraction, and image feature projection. We showcase that six surround-view vehicle images from a single timestamp without global pose information are enough to reconstruct 360$^{\circ}$ scenes during inference time, taking 395 ms. Our method allows, for example, rendering third-person images and birds-eye views. Our code is available at https://github.com/continental/6Img-to-3D, and more examples can be found at our website here https://6Img-to-3D.GitHub.io/.
翻译:当前三维重建技术难以从少量图像中准确推断无界场景。具体而言,现有方法计算需求高、需要精确的位姿信息,且无法可靠重建遮挡区域。我们提出6Img-to-3D——一种高效、可扩展的基于Transformer的编码器-渲染器方法,实现从单次拍摄到三维重建。该方法仅需六张朝外的输入图像,即可为大尺度、无界的户外驾驶场景输出三维一致参数化三平面。通过结合用于三平面参数化的定制化混合交叉与自注意力机制、可微体渲染、场景收缩及图像特征投影,我们向解决现有缺陷迈出一步。实验表明,单时间戳内六张环视车辆图像(无需全局位姿信息)足以在推理阶段重建360°场景,耗时仅395毫秒。该方法支持例如第三人称视角图像及鸟瞰图的渲染。代码已开源至https://github.com/continental/6Img-to-3D,更多示例见我们的网站https://6Img-to-3D.GitHub.io/。