Dynamics models learned from visual observations have shown to be effective in various robotic manipulation tasks. One of the key questions for learning such dynamics models is what scene representation to use. Prior works typically assume representation at a fixed dimension or resolution, which may be inefficient for simple tasks and ineffective for more complicated tasks. In this work, we investigate how to learn dynamic and adaptive representations at different levels of abstraction to achieve the optimal trade-off between efficiency and effectiveness. Specifically, we construct dynamic-resolution particle representations of the environment and learn a unified dynamics model using graph neural networks (GNNs) that allows continuous selection of the abstraction level. During test time, the agent can adaptively determine the optimal resolution at each model-predictive control (MPC) step. We evaluate our method in object pile manipulation, a task we commonly encounter in cooking, agriculture, manufacturing, and pharmaceutical applications. Through comprehensive evaluations both in the simulation and the real world, we show that our method achieves significantly better performance than state-of-the-art fixed-resolution baselines at the gathering, sorting, and redistribution of granular object piles made with various instances like coffee beans, almonds, corn, etc.
翻译:从视觉观测中学习的动力学模型已在多种机器人操作任务中展现出有效性。学习此类动力学模型的关键问题之一在于采用何种场景表示。以往工作通常假设表示具有固定维度或分辨率,这在简单任务中可能效率低下,而在复杂任务中则效果不佳。本研究探讨如何在不同抽象层级学习动态自适应表示,以实现效率与效能的最优权衡。具体而言,我们构建环境的分辨率可变的粒子表示,并利用图神经网络学习可连续选择抽象层级的统一动力学模型。在测试阶段,智能体可在每次模型预测控制步骤中自适应地确定最优分辨率。我们通过在物体堆操作任务中评估该方法——该任务常见于烹饪、农业、制造及制药领域。通过仿真与真实世界的全面评估,我们表明,在针对咖啡豆、杏仁、玉米等不同实例构成的颗粒状物体堆进行聚拢、分选和重分布操作时,该方法相比采用固定分辨率的现有最优基线方法取得了显著更优的性能。