Learning diverse and qualified behaviors for utilization and adaptation without supervision is a key ability of intelligent creatures. Ideal unsupervised skill discovery methods are able to produce diverse and qualified skills in the absence of extrinsic reward, while the discovered skill set can efficiently adapt to downstream tasks in various ways. Maximizing the Mutual Information (MI) between skills and visited states can achieve ideal skill-conditioned behavior distillation in theory. However, it's difficult for recent advanced methods to well balance behavioral quality (exploration) and diversity (exploitation) in practice, which may be attributed to the unreasonable MI estimation by their rigid intrinsic reward design. In this paper, we propose Contrastive multi-objectives Skill Discovery (ComSD) which tries to mitigate the quality-versus-diversity conflict of discovered behaviors through a more reasonable MI estimation and a dynamically weighted intrinsic reward. ComSD proposes to employ contrastive learning for a more reasonable estimation of skill-conditioned entropy in MI decomposition. In addition, a novel weighting mechanism is proposed to dynamically balance different entropy (in MI decomposition) estimations into a novel multi-objective intrinsic reward, to improve both skill diversity and quality. For challenging robot behavior discovery, ComSD can produce a qualified skill set consisting of diverse behaviors at different activity levels, which recent advanced methods cannot. On numerical evaluations, ComSD exhibits state-of-the-art adaptation performance, significantly outperforming recent advanced skill discovery methods across all skill combination tasks and most skill finetuning tasks. Codes will be released at https://github.com/liuxin0824/ComSD.
翻译:学习多样且合格的行为以在无监督条件下进行利用和适应,是智能生物的关键能力。理想的無监督技能发现方法能在无外部奖励的情况下产生多样且合格的技能,同时所发现的技能集能通过多种方式高效适应下游任务。理论上,最大化技能与访问状态之间的互信息可实现理想的技能条件行为蒸馏。然而,当前先进方法在实践中难以良好平衡行为质量(探索)与多样性(利用),这可能是由于其刚性的内在奖励设计导致互信息估计不合理。本文提出对比多目标技能发现(ComSD),通过更合理的互信息估计和动态加权内在奖励,试图缓解发现行为在质量与多样性之间的冲突。ComSD采用对比学习对互信息分解中的技能条件熵进行更合理估计,并提出一种新型加权机制,将互信息分解中的不同熵估计动态平衡为新型多目标内在奖励,以提升技能多样性与质量。针对具挑战性的机器人行为发现任务,ComSD能产生包含不同活动水平多样行为的合格技能集,而现有先进方法无法实现。在数值评估中,ComSD表现出最先进的适应性能,在所有技能组合任务及多数技能微调任务上显著优于近期先进技能发现方法。代码将发布在 https://github.com/liuxin0824/ComSD。