Parallelization in Reinforcement Learning is typically employed to speed up the training of a single policy, where multiple workers collect experience from an identical sampling distribution. This common design limits the potential of parallelization by neglecting the advantages of diverse exploration strategies. We propose K-Myriad, a scalable and unsupervised method that maximizes the collective state entropy induced by a population of parallel policies. By cultivating a portfolio of specialized exploration strategies, K-Myriad provides a robust initialization for Reinforcement Learning, leading to both higher training efficiency and the discovery of heterogeneous solutions. Experiments on high-dimensional continuous control tasks, with large-scale parallelization, demonstrate that K-Myriad can learn a broad set of distinct policies, highlighting its effectiveness for collective exploration and paving the way towards novel parallelization strategies.
翻译:强化学习中的并行化通常用于加速单一策略的训练,其中多个工作器从相同的采样分布中收集经验。这种常见设计忽视了多样化探索策略的优势,从而限制了并行化的潜力。我们提出K-Myriad,一种可扩展的无监督方法,旨在最大化由并行策略群体所诱导的集体状态熵。通过培养一系列专业化的探索策略组合,K-Myriad为强化学习提供了稳健的初始化方案,从而实现了更高的训练效率并促进了异构解决方案的发现。在高维连续控制任务上进行的大规模并行化实验表明,K-Myriad能够学习到一组广泛且各异的策略,这突显了其在集体探索方面的有效性,并为新型并行化策略的发展铺平了道路。