This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. Unsupervised skill discovery seeks to dig out diverse and exploratory skills without extrinsic reward, with the discovered skills efficiently adapting to multiple downstream tasks in various ways. However, recent advanced methods struggle to well balance behavioral exploration and diversity, particularly when the agent dynamics are complex and potential skills are hard to discern (e.g., robot behavior discovery). In this paper, we propose \textbf{Co}ntrastive \textbf{m}ulti-objective \textbf{S}kill \textbf{D}iscovery \textbf{(ComSD)} which discovers exploratory and diverse behaviors through a novel intrinsic incentive, named contrastive multi-objective reward. It contains a novel diversity reward based on contrastive learning to effectively drive agents to discern existing skills, and a particle-based exploration reward to access and learn new behaviors. Moreover, a novel dynamic weighting mechanism between the above two rewards is proposed for diversity-exploration balance, which further improves behavioral quality. Extensive experiments and analysis demonstrate that ComSD can generate diverse behaviors at different exploratory levels for complex multi-joint robots, enabling state-of-the-art performance across 32 challenging downstream adaptation tasks, which recent advanced methods cannot. Codes will be opened after publication.
翻译:本文已提交至IEEE待发表。版权可能在未通知的情况下转移,此后本版本可能无法获取。无监督技能发现旨在无外部奖励条件下挖掘多样且具有探索性的技能,并使发现的技能能够以多种方式高效适应多个下游任务。然而,当智能体动力学复杂且潜在技能难以识别(如机器人行为发现)时,近期先进方法难以平衡行为探索与多样性。本文提出**对比多目标技能发现(ComSD)**,通过一种新型内在激励机制——对比多目标奖励——来发现兼具探索性与多样性的行为。该机制包含基于对比学习的新型多样性奖励,有效驱动智能体识别现有技能;以及基于粒子的探索性奖励,用于访问和学习新行为。此外,我们提出一种新颖的动态权重机制来平衡上述两种奖励,从而进一步提升行为质量。大量实验与分析表明,ComSD能在不同探索层级上为复杂多关节机器人生成多样化行为,在32个具有挑战性的下游适应任务中实现当前最优性能,而现有先进方法无法达到此效果。代码将在发表后公开。