Learning control policies with large discrete action spaces is a challenging problem in the field of reinforcement learning due to present inefficiencies in exploration. With high dimensional action spaces, there are a large number of potential actions in each individual dimension over which policies would be learned. In this work, we introduce a Deep Reinforcement Learning (DRL) algorithm call Multi-Action Networks (MAN) Learning that addresses the challenge of high-dimensional large discrete action spaces. We propose factorizing the N-dimension action space into N 1-dimensional components, known as sub-actions, creating a Value Neural Network for each sub-action. Then, MAN uses temporal-difference learning to train the networks synchronously, which is simpler than training a single network with a large action output directly. To evaluate the proposed method, we test MAN on three scenarios: an n-dimension maze task, a block stacking task, and then extend MAN to handle 12 games from the Atari Arcade Learning environment with 18 action spaces. Our results indicate that MAN learns faster than both Deep Q-Learning and Double Deep Q-Learning, implying our method is a better performing synchronous temporal difference algorithm than those currently available for large discrete action spaces.
翻译:在强化学习领域中,由于探索效率低下,学习具有大规模离散动作空间的控制策略是一项具有挑战性的问题。在高维动作空间中,每个单独维度上存在大量潜在动作,而策略需要针对这些动作进行学习。本文提出了一种名为多行动网络(MAN)学习的深度强化学习(DRL)算法,用于解决高维大规模离散动作空间的挑战。我们建议将N维动作空间分解为N个一维分量(称为子动作),并为每个子动作创建一个价值神经网络。随后,MAN利用时序差分学习同步训练这些网络,这比直接训练具有大规模动作输出的单一网络更为简单。为了评估所提出的方法,我们在三种场景下测试了MAN:一个n维迷宫任务、一个积木堆叠任务,并将MAN扩展到处理来自Atari Arcade学习环境中具有18个动作空间的12款游戏。结果表明,MAN的学习速度优于深度Q学习和双重深度Q学习,这表明我们的方法是一种比当前用于大规模离散动作空间的同步时序差分算法性能更优的方法。