We propose the first Large Reconstruction Model (LRM) that predicts the 3D model of an object from a single input image within just 5 seconds. In contrast to many previous methods that are trained on small-scale datasets such as ShapeNet in a category-specific fashion, LRM adopts a highly scalable transformer-based architecture with 500 million learnable parameters to directly predict a neural radiance field (NeRF) from the input image. We train our model in an end-to-end manner on massive multi-view data containing around 1 million objects, including both synthetic renderings from Objaverse and real captures from MVImgNet. This combination of a high-capacity model and large-scale training data empowers our model to be highly generalizable and produce high-quality 3D reconstructions from various testing inputs, including real-world in-the-wild captures and images created by generative models. Video demos and interactable 3D meshes can be found on our LRM project webpage: https://yiconghong.me/LRM.
翻译:本文提出了首个大型重构模型(LRM),该模型能在5秒内从单张输入图像预测物体的三维模型。与以往许多在ShapeNet等小规模数据集上以类别特定方式训练的方法不同,LRM采用了高度可扩展的Transformer架构,包含5亿个可学习参数,可直接从输入图像预测神经辐射场(NeRF)。我们以端到端方式在包含约100万个物体的大规模多视图数据(包括来自Objaverse的合成渲染数据和来自MVImgNet的真实捕捉数据)上训练该模型。高容量模型与大规模训练数据的结合,使得我们的模型具有高度泛化能力,能够从各种测试输入(包括真实世界野外捕捉及生成模型创建的图像)中生成高质量三维重建结果。视频演示和可交互的三维网格可在LRM项目网页查看:https://yiconghong.me/LRM。