Human hands possess remarkable dexterity and have long served as a source of inspiration for robotic manipulation. In this work, we propose a human $\textbf{H}$and$\textbf{-In}$formed visual representation learning framework to solve difficult $\textbf{Dex}$terous manipulation tasks ($\textbf{H-InDex}$) with reinforcement learning. Our framework consists of three stages: (i) pre-training representations with 3D human hand pose estimation, (ii) offline adapting representations with self-supervised keypoint detection, and (iii) reinforcement learning with exponential moving average BatchNorm. The last two stages only modify $0.36\%$ parameters of the pre-trained representation in total, ensuring the knowledge from pre-training is maintained to the full extent. We empirically study 12 challenging dexterous manipulation tasks and find that H-InDex largely surpasses strong baseline methods and the recent visual foundation models for motor control. Code is available at https://yanjieze.com/H-InDex .
翻译:人类手部具有惊人的灵巧性,长期以来一直是机器人操作任务的灵感源泉。本文提出一种基于人类$\textbf{H}$手$\textbf{-In}$引导的视觉表征学习框架$\textbf{H-InDex}$,结合强化学习方法解决困难的$\textbf{Dex}$灵巧操作任务。该框架包含三个阶段:(i) 基于3D人体手部姿态估计的预训练表征构建,(ii) 基于自监督关键点检测的离线表征自适应调整,以及(iii) 结合指数移动平均批归一化的强化学习。其中后两阶段总共仅调整预训练表征中$0.36\%$的参数,从而最大限度保持预训练阶段获取的知识。我们对12项具有挑战性的灵巧操作任务进行实证研究,结果表明H-InDex在性能上显著超越强基线方法及近期用于运动控制的视觉基础模型。代码开源地址:https://yanjieze.com/H-InDex。