Learning Versatile Humanoid Manipulation with Touch Dreaming

Humanoid robots promise general-purpose assistance, yet real-world humanoid loco-manipulation remains challenging because it requires whole-body stability, dexterous hands, and contact-aware perception under frequent contact changes. In this work, we study dexterous, contact-rich humanoid loco-manipulation. We first develop an RL-based whole-body controller that provides stable lower-body and torso execution during complex manipulation. Built on this controller, we develop a whole-body humanoid data collection system that combines VR-based teleoperation with human-to-humanoid motion mapping, enabling efficient collection of real-world demonstrations. We then propose Humanoid Transformer with Touch Dreaming (HTD), a multimodal encoder--decoder Transformer that models touch as a core modality alongside multi-view vision and proprioception. HTD is trained in a single stage with behavioral cloning augmented by touch dreaming: in addition to predicting action chunks, the policy predicts future hand-joint forces and future tactile latents, encouraging the shared Transformer trunk to learn contact-aware representations for dexterous interaction. Across five contact-rich tasks, Insert-T, Book Organization, Towel Folding, Cat Litter Scooping, and Tea Serving, HTD achieves a 90.9% relative improvement in average success rate over the stronger baseline. Ablation results further show that latent-space tactile prediction is more effective than raw tactile prediction, yielding a 30% relative gain in success rate. These results demonstrate that combining robust whole-body execution, scalable humanoid data collection, and predictive touch-centered learning enables versatile, high-dexterity humanoid manipulation in the real world. Project webpage: humanoid-touch-dream.github.io.

翻译：人形机器人有望实现通用辅助，但现实世界中的人形机器人移动操作仍然充满挑战，因为它要求在全身体态稳定、灵巧手部以及频繁接触变化下的接触感知能力。本文研究了灵巧、密集接触的人形机器人移动操作问题。我们首先开发了一个基于强化学习的全身控制器，在复杂操作过程中提供稳定的下半身和躯干执行。基于该控制器，我们构建了一个结合基于虚拟现实的遥操作和人体到人形机器人运动映射的全身人形数据采集系统，实现了真实世界演示的高效收集。接着，我们提出了带触觉梦想的人形Transformer（HTD），这是一个多模态编码器-解码器Transformer，将触觉作为核心模态，与多视角视觉和本体感觉一同建模。HTD通过行为克隆以及触觉梦想进行单阶段训练：除了预测动作片段外，策略还预测未来手部关节力和未来触觉潜在表示，从而鼓励共享Transformer主干学习用于灵巧交互的接触感知表示。在五项密集接触任务——插销-T、书籍整理、毛巾折叠、猫砂铲取和倒茶服务中，相较于更强的基线，HTD的平均成功率相对提升了90.9%。消融实验进一步表明，潜在空间触觉预测比原始触觉预测更有效，成功率相对提升了30%。这些结果表明，将稳健的全身执行、可扩展的人形数据收集以及以触觉为中心的学习预测相结合，能够在现实世界中实现多才多艺、高灵巧度的人形机器人操作。项目页面：humanoid-touch-dream.github.io。