Reconstructing a 3D object from a 2D image is a well-researched vision problem, with many kinds of deep learning techniques having been tried. Most commonly, 3D convolutional approaches are used, though previous work has shown state-of-the-art methods using 2D convolutions that are also significantly more efficient to train. With the recent rise of transformers for vision tasks, often outperforming convolutional methods, along with some earlier attempts to use transformers for 3D object reconstruction, we set out to use visual transformers in place of convolutions in existing efficient, high-performing techniques for 3D object reconstruction in order to achieve superior results on the task. Using a transformer-based encoder and decoder to predict 3D structure from 2D images, we achieve accuracy similar or superior to the baseline approach. This study serves as evidence for the potential of visual transformers in the task of 3D object reconstruction.
翻译:从二维图像重建三维物体是一个被广泛研究的视觉问题,已有多种深度学习技术被尝试应用于此。最常用的方法包括三维卷积方法,尽管先前的研究表明,基于二维卷积的最先进方法在训练效率上也显著更高。随着Transformer在视觉任务中的兴起(其表现通常优于卷积方法),以及早期将Transformer用于三维物体重建的一些尝试,我们着手在现有高效、高性能的三维物体重建技术中,用视觉Transformer替代卷积,以期在该任务上取得更优结果。通过采用基于Transformer的编码器和解码器从二维图像预测三维结构,我们实现了与基线方法相似甚至更优的精度。本研究为视觉Transformer在三维物体重建任务中的潜力提供了证据。