Generating 3D shapes from single RGB images is essential in various applications such as robotics. Current approaches typically target images containing clear and complete visual descriptions of the object, without considering common realistic cases where observations of objects that are largely occluded or truncated. We thus propose a transformer-based autoregressive model to generate the probabilistic distribution of 3D shapes conditioned on an RGB image containing potentially highly ambiguous observations of the object. To handle realistic scenarios such as occlusion or field-of-view truncation, we create simulated image-to-shape training pairs that enable improved fine-tuning for real-world scenarios. We then adopt cross-attention to effectively identify the most relevant region of interest from the input image for shape generation. This enables inference of sampled shapes with reasonable diversity and strong alignment with the input image. We train and test our model on our synthetic data then fine-tune and test it on real-world data. Experiments demonstrate that our model outperforms state of the art in both scenarios
翻译:从单张RGB图像生成三维形状在机器人等众多应用中至关重要。当前方法通常针对包含物体清晰完整视觉描述的图像,未考虑物体被大面积遮挡或截断等常见现实情况。为此,我们提出一种基于Transformer的自回归模型,用于生成以可能包含高度模糊物体观测的单张RGB图像为条件的三维形状概率分布。为处理遮挡或视野截断等现实场景,我们创建了模拟图像-形状训练对,从而实现对真实场景的改进微调。随后采用交叉注意力机制从输入图像中有效识别最相关的感兴趣区域以生成形状,这使得采样形状具有合理的多样性并与输入图像保持强对齐。我们在合成数据上训练并测试模型,随后在真实数据上微调并测试。实验表明,我们的模型在两种场景下均优于现有最优方法。