Reinforcement learning necessitates meticulous reward shaping by specialists to elicit target behaviors, while imitation learning relies on costly task-specific data. In contrast, unsupervised skill discovery can potentially reduce these burdens by learning a diverse repertoire of useful skills driven by intrinsic motivation. However, existing methods exhibit two key limitations: they typically rely on a single policy to master a versatile repertoire of behaviors without modeling the shared structure or distinctions among them, which results in low learning efficiency; moreover, they are susceptible to reward hacking, where the reward signal increases and converges rapidly while the learned skills display insufficient actual diversity. In this work, we introduce an Orthogonal Mixture-of-Experts (OMoE) architecture that prevents diverse behaviors from collapsing into overlapping representations, enabling a single policy to master a wide spectrum of locomotion skills. In addition, we design a multi-discriminator framework in which different discriminators operate on distinct observation spaces, effectively mitigating reward hacking. We evaluated our method on the 12-DOF Unitree A1 quadruped robot, demonstrating a diverse set of locomotion skills. Our experiments demonstrate that the proposed framework boosts training efficiency and yields an 18.3\% expansion in state-space coverage compared to the baseline.
翻译:强化学习需要专家精心设计奖励函数以引导目标行为,而模仿学习则依赖于成本高昂的任务特定数据。相比之下,无监督技能发现通过学习由内在动机驱动的多样化实用技能库,有望减轻这些负担。然而,现有方法存在两个关键局限:它们通常依赖单一策略来掌握多功能的技能库,而未对技能间的共享结构或差异进行建模,导致学习效率低下;此外,这些方法容易受到奖励欺骗的影响,即奖励信号迅速增加并收敛,而习得的技能却表现出不足的实际多样性。本研究提出了一种正交专家混合(OMoE)架构,该架构可防止多样化行为坍缩为重叠的表征,使单一策略能够掌握广泛的运动技能。此外,我们设计了一个多判别器框架,其中不同的判别器作用于不同的观测空间,有效缓解了奖励欺骗问题。我们在12自由度宇树A1四足机器人上评估了所提方法,展示了一系列多样化的运动技能。实验结果表明,与基线方法相比,所提出的框架提升了训练效率,并使状态空间覆盖范围扩大了18.3%。