Human hands possess remarkable dexterity and have long served as a source of inspiration for robotic manipulation. In this work, we propose a human $\textbf{H}$and$\textbf{-In}$formed visual representation learning framework to solve difficult $\textbf{Dex}$terous manipulation tasks ($\textbf{H-InDex}$) with reinforcement learning. Our framework consists of three stages: (i) pre-training representations with 3D human hand pose estimation, (ii) offline adapting representations with self-supervised keypoint detection, and (iii) reinforcement learning with exponential moving average BatchNorm. The last two stages only modify $0.36\%$ parameters of the pre-trained representation in total, ensuring the knowledge from pre-training is maintained to the full extent. We empirically study 12 challenging dexterous manipulation tasks and find that H-InDex largely surpasses strong baseline methods and the recent visual foundation models for motor control. Code is available at https://yanjieze.com/H-InDex .
翻译:人类手部具有非凡的灵巧性,长期以来一直是机器人操作领域的灵感来源。本文提出一种基于人类手势表征的视觉表征学习框架(H-InDex),通过强化学习解决高难度灵巧操作任务。该框架包含三个阶段:(i) 基于3D人手姿态估计的预训练表征,(ii) 基于自监督关键点检测的离线表征适配,(iii) 结合指数移动平均批归一化的强化学习。后两个阶段仅修改预训练表征参数的0.36%,从而最大程度保留预训练知识。我们通过12项具有挑战性的灵巧操作任务进行实证研究,发现H-InDex大幅超越现有强基线方法及最新用于运动控制的视觉基础模型。代码已开源至 https://yanjieze.com/H-InDex 。