Objects rarely sit in isolation in everyday human environments. If we want robots to operate and perform tasks in our human environments, they must understand how the objects they manipulate will interact with structural elements of the environment for all but the simplest of tasks. As such, we'd like our robots to reason about how multiple objects and environmental elements relate to one another and how those relations may change as the robot interacts with the world. We examine the problem of predicting inter-object and object-environment relations between previously unseen objects and novel environments purely from partial-view point clouds. Our approach enables robots to plan and execute sequences to complete multi-object manipulation tasks defined from logical relations. This removes the burden of providing explicit, continuous object states as goals to the robot. We explore several different neural network architectures for this task. We find the best performing model to be a novel transformer-based neural network that both predicts object-environment relations and learns a latent-space dynamics function. We achieve reliable sim-to-real transfer without any fine-tuning. Our experiments show that our model understands how changes in observed environmental geometry relate to semantic relations between objects. We show more videos on our website: https://sites.google.com/view/erelationaldynamics.
翻译:物体在日常生活中很少孤立存在。若要让机器人在人类环境中操作并执行任务,它们必须理解所操纵的物体如何与环境结构元素相互作用——即使对于最简单的任务也是如此。因此,我们希望机器人能够推理多个物体与环境元素之间的关联关系,以及这些关系如何随着机器人与世界的交互而变化。我们研究了从局部视角点云中预测未见物体与新颖环境之间物体间及物体-环境关系的问题。该方法使机器人能够基于逻辑关系定义的多物体操作任务规划并执行序列动作,从而免去为机器人提供明确、连续的物体状态作为目标的负担。我们探索了多种适用于该任务的神经网络架构,发现表现最佳的模型是一种基于Transformer的新型神经网络,它既能预测物体-环境关系,又能学习潜空间动力学函数。我们实现了无需微调的可靠仿真到现实迁移。实验表明,模型能够理解观测到的环境几何变化如何与物体间的语义关系相关联。更多视频见我们的网站:https://sites.google.com/view/erelationaldynamics