The reproducibility of many experimental results in Deep Reinforcement Learning (RL) is under question. To solve this reproducibility crisis, we propose a theoretically sound methodology to compare multiple Deep RL algorithms. The performance of one execution of a Deep RL algorithm is random so that independent executions are needed to assess it precisely. When comparing several RL algorithms, a major question is how many executions must be made and how can we assure that the results of such a comparison is theoretically sound. Researchers in Deep RL often use less than 5 independent executions to compare algorithms: we claim that this is not enough in general. Moreover, when comparing several algorithms at once, the error of each comparison accumulates and must be taken into account with a multiple tests procedure to preserve low error guarantees. To address this problem in a statistically sound way, we introduce AdaStop, a new statistical test based on multiple group sequential tests. When comparing algorithms, AdaStop adapts the number of executions to stop as early as possible while ensuring that we have enough information to distinguish algorithms that perform better than the others in a statistical significant way. We prove both theoretically and empirically that AdaStop has a low probability of making an error (Family-Wise Error). Finally, we illustrate the effectiveness of AdaStop in multiple use-cases, including toy examples and difficult cases such as Mujoco environments.
翻译:深度强化学习领域的诸多实验结果可重复性备受质疑。为解决这一可重复性危机,我们提出了一种理论完备的方法论来比较多种深度强化学习算法。由于单次深度强化学习算法执行的性能具有随机性,需要进行独立重复执行才能精确评估。在比较多个强化学习算法时,核心问题在于需要执行多少次实验,以及如何确保这种比较结果在理论上具有可靠性。深度强化学习研究者通常采用少于5次独立执行来比较算法:我们认为这通常是不够的。此外,当同时比较多个算法时,每次比较产生的误差会累积,必须采用多重检验程序来控制以保证低误差保证。为解决这一统计意义上的问题,我们提出AdaStop——一种基于多重组序贯检验的新型统计检验方法。在比较算法时,AdaStop可自适应调整执行次数,在确保获得足够信息以统计显著区分表现更优算法的前提下尽早终止。我们从理论和实证两方面证明AdaStop具有较低的错误概率(族系误差率)。最后,我们在包括玩具示例及Mujoco环境等困难案例在内的多个应用场景中验证了AdaStop的有效性。