Being able to solve a task in diverse ways makes agents more robust to task variations and less prone to local optima. In this context, constrained diversity optimization has become a useful reinforcement learning (RL) framework for training a set of diverse agents in parallel. However, existing constrained-diversity RL methods often under-explore in complex tasks such as robot manipulation, resulting in limited behavioral diversity. We address this with a two-stage curriculum that introduces a spline-based trajectory prior as an inductive bias to produce diverse, high-reward behaviors in an initial stage, and then distills these behaviors into reactive, step-wise policies in a second stage. In our empirical evaluation, we provide novel insights into challenges of diversity-targeted training and show that our curriculum increases the diversity of learned skills while maintaining high task performance.
翻译:能够以多样化方式解决任务,使智能体对任务变化更具鲁棒性,且不易陷入局部最优。在此背景下,约束多样性优化已成为一种并行训练多样化智能体集合的有效强化学习框架。然而,现有约束多样性强化学习方法在复杂任务(如机器人操作)中常存在探索不足的问题,导致行为多样性受限。为此,我们提出一种两阶段课程:第一阶段引入基于样条的轨迹先验作为归纳偏置,以生成多样化、高回报的行为;第二阶段将这些行为蒸馏为反应式的逐步骤策略。通过实证评估,我们揭示了多样性导向训练中的新挑战,并证明该方法在保持高任务性能的同时,显著提升了习得技能的多样性。