Learning general-purpose models from diverse datasets has achieved great success in machine learning. In robotics, however, existing methods in multi-task learning are typically constrained to a single robot and workspace, while recent work such as RT-X requires a non-trivial action normalization procedure to manually bridge the gap between different action spaces in diverse environments. In this paper, we propose the visual kinematics chain as a precise and universal representation of quasi-static actions for robot learning over diverse environments, which requires no manual adjustment since the visual kinematic chains can be automatically obtained from the robot's model and camera parameters. We propose the Visual Kinematics Transformer (VKT), a convolution-free architecture that supports an arbitrary number of camera viewpoints, and that is trained with a single objective of forecasting kinematic structures through optimal point-set matching. We demonstrate the superior performance of VKT over BC transformers as a general agent on Calvin, RLBench, Open-X, and real robot manipulation tasks. Video demonstrations can be found at https://mlzxy.github.io/visual-kinetic-chain.
翻译:从多样化数据集中学习通用模型已在机器学习领域取得巨大成功。然而在机器人学中,现有的多任务学习方法通常局限于单一机器人与工作空间,而近期如RT-X等工作需要通过复杂的动作归一化程序来手动弥合不同环境中动作空间的差异。本文提出视觉运动链作为跨多样化环境机器人学习中准静态动作的精确通用表示,该表示无需人工调整,因为视觉运动链可直接从机器人模型与相机参数自动获取。我们提出视觉运动链变换器(VKT),这是一种支持任意数量相机视角的无卷积架构,通过最优点集匹配预测运动链结构的单一目标进行训练。我们在Calvin、RLBench、Open-X及真实机器人操作任务上验证了VKT作为通用智能体相对于BC变换器的卓越性能。视频演示可见:https://mlzxy.github.io/visual-kinetic-chain。