In embodied intelligence, the embodiment gap between robotic and human hands brings significant challenges for learning from human demonstrations. Although some studies have attempted to bridge this gap using reinforcement learning, they remain confined to merely reproducing human manipulation, resulting in limited task performance. Moreover, current methods struggle to support diverse robotic hand configurations. In this paper, we propose UniBYD, a unified framework that uses a dynamic reinforcement learning algorithm to discover manipulation policies aligned with the robot's physical characteristics. To enable consistent modeling across diverse robotic hand morphologies, UniBYD incorporates a unified morphological representation (UMR). Building on UMR, we design a dynamic PPO with an annealed reward schedule, enabling reinforcement learning to transition from offline-informed imitation of human demonstrations to online-adaptive exploration of policies better adapted to diverse robotic morphologies, thereby going beyond mere imitation of human hands. To address the severe state drift caused by the incapacity of early-stage policies, we design a hybrid Markov-based shadow engine that provides fine-grained guidance to anchor the imitation within the expert's manifold. To evaluate UniBYD, we propose UniManip, the first benchmark for cross-embodiment manipulation spanning diverse robotic morphologies. Experiments demonstrate a 44.08% average improvement in success rate over the current state-of-the-art. Upon acceptance, we will release our code and benchmark.
翻译:在具身智能中,机器人手与人类手之间的形态差异为从人类演示中学习带来了重大挑战。尽管已有研究尝试通过强化学习来弥合这一差距,但它们仍局限于单纯复现人类操作,导致任务性能受限。此外,现有方法难以支持多样化的机器人手构型。本文提出UniBYD,一个采用动态强化学习算法来发现符合机器人物理特性的操作策略的统一框架。为实现跨不同机器人手形态的一致性建模,UniBYD引入了统一形态表示(UMR)。基于UMR,我们设计了一种带有退火奖励调度的动态PPO算法,使强化学习能够从基于离线信息的人类演示模仿,过渡到在线自适应地探索更适应不同机器人形态的策略,从而超越对人类手的单纯模仿。为解决早期策略能力不足导致的严重状态漂移问题,我们设计了一种基于混合马尔可夫的影子引擎,提供细粒度指导以将模仿过程锚定在专家流形内。为评估UniBYD,我们提出了首个涵盖多样化机器人形态的跨形态操作基准UniManip。实验表明,其成功率较当前最优方法平均提升44.08%。论文录用后,我们将公开代码与基准。