Dynamics models learned from visual observations have shown to be effective in various robotic manipulation tasks. One of the key questions for learning such dynamics models is what scene representation to use. Prior works typically assume representation at a fixed dimension or resolution, which may be inefficient for simple tasks and ineffective for more complicated tasks. In this work, we investigate how to learn dynamic and adaptive representations at different levels of abstraction to achieve the optimal trade-off between efficiency and effectiveness. Specifically, we construct dynamic-resolution particle representations of the environment and learn a unified dynamics model using graph neural networks (GNNs) that allows continuous selection of the abstraction level. During test time, the agent can adaptively determine the optimal resolution at each model-predictive control (MPC) step. We evaluate our method in object pile manipulation, a task we commonly encounter in cooking, agriculture, manufacturing, and pharmaceutical applications. Through comprehensive evaluations both in the simulation and the real world, we show that our method achieves significantly better performance than state-of-the-art fixed-resolution baselines at the gathering, sorting, and redistribution of granular object piles made with various instances like coffee beans, almonds, corn, etc.
翻译:从视觉观测中学习到的动力学模型已被证明在各种机器人操作任务中有效。学习此类动力学模型的关键问题之一是使用何种场景表示。先前的工作通常假设固定维度或分辨率的表示,这可能在简单任务中效率低下,而在复杂任务中效果不佳。本文研究了如何在不同抽象层级学习动态且自适应的表示,以实现效率与效果的最优权衡。具体而言,我们构建了环境的动态分辨率粒子表示,并利用图神经网络(GNN)学习了一个统一动力学模型,该模型允许连续选择抽象层级。在测试阶段,智能体可以在每个模型预测控制(MPC)步骤中自适应地确定最优分辨率。我们在物体堆操作任务中评估了该方法,这类任务常见于烹饪、农业、制造和制药应用场景。通过在仿真和真实世界中的全面评估,我们证明,在收集、分类和重新分布由咖啡豆、杏仁、玉米等各类实例构成的颗粒状物体堆时,该方法显著优于最先进的固定分辨率基线方法。