PEAC：面向跨具身强化学习的无监督预训练方法 (PEAC: Unsupervised Pre-training for Cross-Embodiment Reinforcement Learning)

Designing generalizable agents capable of adapting to diverse embodiments has achieved significant attention in Reinforcement Learning (RL), which is critical for deploying RL agents in various real-world applications. Previous Cross-Embodiment RL approaches have focused on transferring knowledge across embodiments within specific tasks. These methods often result in knowledge tightly coupled with those tasks and fail to adequately capture the distinct characteristics of different embodiments. To address this limitation, we introduce the notion of Cross-Embodiment Unsupervised RL (CEURL), which leverages unsupervised learning to enable agents to acquire embodiment-aware and task-agnostic knowledge through online interactions within reward-free environments. We formulate CEURL as a novel Controlled Embodiment Markov Decision Process (CE-MDP) and systematically analyze CEURL's pre-training objectives under CE-MDP. Based on these analyses, we develop a novel algorithm Pre-trained Embodiment-Aware Control (PEAC) for handling CEURL, incorporating an intrinsic reward function specifically designed for cross-embodiment pre-training. PEAC not only provides an intuitive optimization strategy for cross-embodiment pre-training but also can integrate flexibly with existing unsupervised RL methods, facilitating cross-embodiment exploration and skill discovery. Extensive experiments in both simulated (e.g., DMC and Robosuite) and real-world environments (e.g., legged locomotion) demonstrate that PEAC significantly improves adaptation performance and cross-embodiment generalization, demonstrating its effectiveness in overcoming the unique challenges of CEURL. The project page and code are in https://yingchengyang.github.io/ceurl.

翻译：设计能够适应不同具身形态的通用智能体在强化学习领域受到广泛关注，这对将强化学习智能体部署于多样化的现实应用至关重要。以往的跨具身强化学习方法主要关注在特定任务中实现跨具身知识迁移，这类方法往往导致所学知识与具体任务紧密耦合，未能充分捕捉不同具身形态的独有特性。为克服这一局限，我们提出跨具身无监督强化学习这一新范式，其利用无监督学习使智能体通过在无奖励环境中的在线交互，获取具身感知且与任务无关的知识。我们将该范式形式化为一种新颖的受控具身马尔可夫决策过程，并在此框架下系统分析了其预训练目标。基于此分析，我们开发了一种名为预训练具身感知控制的新算法来处理跨具身无监督强化学习问题，该算法包含一个专为跨具身预训练设计的内在奖励函数。PEAC不仅为跨具身预训练提供了直观的优化策略，还能灵活集成现有无监督强化学习方法，促进跨具身探索与技能发现。在模拟环境（如DMC和Robosuite）和真实世界环境（如足式运动）中进行的大量实验表明，PEAC显著提升了适应性能与跨具身泛化能力，证明了其在应对跨具身无监督强化学习特有挑战方面的有效性。项目主页与代码位于 https://yingchengyang.github.io/ceurl。