AutoQD: Automatic Discovery of Diverse Behaviors with Quality-Diversity Optimization

Quality-Diversity (QD) algorithms have shown remarkable success in discovering diverse, high-performing solutions, but rely heavily on hand-crafted behavioral descriptors that constrain exploration to predefined notions of diversity. Leveraging the equivalence between policies and occupancy measures, we present a theoretically grounded approach to automatically generate behavioral descriptors by embedding the occupancy measures of policies in Markov Decision Processes. Our method, AutoQD, leverages random Fourier features to approximate the Maximum Mean Discrepancy (MMD) between policy occupancy measures, creating embeddings whose distances reflect meaningful behavioral differences. A low-dimensional projection of these embeddings that captures the most behaviorally significant dimensions can then be used as behavioral descriptors for CMA-MAE, a state of the art blackbox QD method, to discover diverse policies. We prove that our embeddings converge to true MMD distances between occupancy measures as the number of sampled trajectories and embedding dimensions increase. Through experiments in multiple continuous control tasks we demonstrate AutoQD's ability in discovering diverse policies without predefined behavioral descriptors, presenting a well-motivated alternative to prior methods in unsupervised Reinforcement Learning and QD optimization. Our approach opens new possibilities for open-ended learning and automated behavior discovery in sequential decision making settings without requiring domain-specific knowledge. Source code is available at https://github.com/conflictednerd/autoqd-code.

翻译：质量多样性（QD）算法在发现多样化、高性能解决方案方面已展现出显著成效，但其严重依赖人工设计的行为描述符，将探索范围限制在预定义的多样性概念中。利用策略与占用测度之间的等价性，我们提出了一种理论依据充分的方法，通过嵌入马尔可夫决策过程中策略的占用测度来自动生成行为描述符。我们的方法AutoQD利用随机傅里叶特征来近似策略占用测度之间的最大均值差异（MMD），从而创建出距离能够反映有意义行为差异的嵌入表示。随后，可对这些嵌入进行低维投影以捕捉最具行为显著性的维度，并将其作为行为描述符应用于当前最先进的黑盒QD方法CMA-MAE，以发现多样化策略。我们证明，随着采样轨迹数量和嵌入维度的增加，我们的嵌入表示会收敛到占用测度之间的真实MMD距离。通过在多个连续控制任务中的实验，我们证明了AutoQD能够在无需预定义行为描述符的情况下发现多样化策略，为无监督强化学习和QD优化领域提供了动机充分的新方法。我们的方法为序列决策设定中的开放式学习和自动化行为发现开辟了新可能，且无需领域特定知识。源代码可在 https://github.com/conflictednerd/autoqd-code 获取。