Humans can easily understand a single image as depicting multiple potential objects permitting interaction. We use this skill to plan our interactions with the world and accelerate understanding new objects without engaging in interaction. In this paper, we would like to endow machines with the similar ability, so that intelligent agents can better explore the 3D scene or manipulate objects. Our approach is a transformer-based model that predicts the 3D location, physical properties and affordance of objects. To power this model, we collect a dataset with Internet videos, egocentric videos and indoor images to train and validate our approach. Our model yields strong performance on our data, and generalizes well to robotics data.
翻译:人类可以轻松地理解单张图像描绘了多个可交互的潜在物体。我们利用这一技能规划与世界的交互,并在无需实际交互的情况下加速对新型物体的认知。本文旨在赋予机器类似的能力,使智能体能够更好地探索三维场景或操控物体。我们提出了一种基于Transformer的模型,用于预测物体的三维位置、物理属性及可供性。为支撑该模型,我们收集了包含网络视频、第一人称视频和室内图像的训练验证数据集。该模型不仅在我们的数据集上表现优异,还能良好地泛化到机器人数据中。