The Arcade Learning Environment (ALE) is proposed as an evaluation platform for empirically assessing the generality of agents across dozens of Atari 2600 games. ALE offers various challenging problems and has drawn significant attention from the deep reinforcement learning (RL) community. From Deep Q-Networks (DQN) to Agent57, RL agents seem to achieve superhuman performance in ALE. However, is this the case? In this paper, to explore this problem, we first review the current evaluation metrics in the Atari benchmarks and then reveal that the current evaluation criteria of achieving superhuman performance are inappropriate, which underestimated the human performance relative to what is possible. To handle those problems and promote the development of RL research, we propose a novel Atari benchmark based on human world records (HWR), which puts forward higher requirements for RL agents on both final performance and learning efficiency. Furthermore, we summarize the state-of-the-art (SOTA) methods in Atari benchmarks and provide benchmark results over new evaluation metrics based on human world records. We concluded that at least four open challenges hinder RL agents from achieving superhuman performance from those new benchmark results. Finally, we also discuss some promising ways to handle those problems.
翻译:Arcade学习环境(ALE)被提出作为经验性评估智能体在数十款Atari 2600游戏中通用性的评估平台。ALE提供了多种具有挑战性的问题,引起了深度强化学习(RL)领域的广泛关注。从深度Q网络(DQN)到Agent57,RL智能体似乎已在ALE中达到超人类性能。然而,事实果真如此吗?本文为探究该问题,首先回顾了Atari基准测试中当前的评估指标,进而揭示出当前超人类性能的评估标准并不恰当——该标准低估了人类在可能范围内的实际表现。为解决这些问题并推动RL研究发展,我们提出了基于人类世界纪录(HWR)的新型Atari基准,对RL智能体的最终性能与学习效率提出了更高要求。此外,我们总结了Atari基准测试中的当前最优方法(SOTA),并提供了基于人类世界纪录新评估指标的基准测试结果。从这些新基准结果中,我们得出结论:至少存在四个开放挑战阻碍RL智能体实现超人类性能。最后,我们还讨论了解决这些问题的若干可行方向。