One of the roadblocks for training generalist robotic models today is heterogeneity. Previous robot learning methods often collect data to train with one specific embodiment for one task, which is expensive and prone to overfitting. This work studies the problem of learning policy representations through heterogeneous pre-training on robot data across different embodiments and tasks at scale. We propose Heterogeneous Pre-trained Transformers (HPT), which pre-train a large, shareable trunk of a policy neural network to learn a task and embodiment agnostic shared representation. This general architecture aligns the specific proprioception and vision inputs from distinct embodiments to a short sequence of tokens and then processes such tokens to map to control robots for different tasks. Leveraging the recent large-scale multi-embodiment real-world robotic datasets as well as simulation, deployed robots, and human video datasets, we investigate pre-training policies across heterogeneity. We conduct experiments to investigate the scaling behaviors of training objectives, to the extent of 52 datasets. HPTs outperform several baselines and enhance the fine-tuned policy performance by over 20% on unseen tasks in multiple simulator benchmarks and real-world settings. See the project website (https://liruiw.github.io/hpt/) for code and videos.
翻译:当前训练通用机器人模型的主要障碍之一是异构性。以往的机器人学习方法通常针对特定任务收集数据训练单一构型,这种方法成本高昂且容易过拟合。本研究探讨了通过大规模异构预训练跨不同构型和任务的机器人数据来学习策略表征的问题。我们提出异构预训练Transformer(HPT),该方法预训练一个大型、可共享的策略神经网络主干,以学习任务与构型无关的共享表征。该通用架构将来自不同构型的本体感知与视觉输入对齐为短令牌序列,进而处理这些令牌以映射至不同任务的控制机器人。借助近期大规模多构型真实世界机器人数据集,以及仿真、部署机器人和人类视频数据集,我们研究了跨异构性的策略预训练。我们通过实验探究了训练目标的扩展行为,实验范围涵盖52个数据集。HPT在多个仿真基准测试和真实场景中,对未见任务的微调策略性能提升超过20%,优于多个基线方法。代码与视频详见项目网站(https://liruiw.github.io/hpt/)。