Generating 3D shapes from single RGB images is essential in various applications such as robotics. Current approaches typically target images containing clear and complete visual descriptions of the object, without considering common realistic cases where observations of objects that are largely occluded or truncated. We thus propose a transformer-based autoregressive model to generate the probabilistic distribution of 3D shapes conditioned on an RGB image containing potentially highly ambiguous observations of the object. To handle realistic scenarios such as occlusion or field-of-view truncation, we create simulated image-to-shape training pairs that enable improved fine-tuning for real-world scenarios. We then adopt cross-attention to effectively identify the most relevant region of interest from the input image for shape generation. This enables inference of sampled shapes with reasonable diversity and strong alignment with the input image. We train and test our model on our synthetic data then fine-tune and test it on real-world data. Experiments demonstrate that our model outperforms state of the art in both scenarios.
翻译:从单张RGB图像生成三维形状在机器人学等众多应用中至关重要。现有方法通常针对包含清晰完整物体视觉描述的图像,而未考虑物体被严重遮挡或截断的常见现实情况。为此,我们提出一种基于Transformer的自回归模型,用于生成以包含潜在高度模糊物体观测的RGB图像为条件的三维形状概率分布。为处理遮挡或视场截断等现实场景,我们创建了模拟图像-形状训练对,以提升模型在真实场景中的微调性能。随后采用交叉注意力机制,从输入图像中有效识别形状生成最相关的感兴趣区域。这使得采样形状的推断既具有合理多样性,又与输入图像保持高度对齐。我们在合成数据上训练并测试模型,继而在真实数据上进行微调与测试。实验表明,我们的模型在两种场景下均优于现有最优方法。