AdaStop: adaptive statistical testing for sound comparisons of Deep RL agents

Recently, the scientific community has questioned the statistical reproducibility of many empirical results, especially in the field of machine learning. To contribute to the resolution of this reproducibility crisis, we propose a theoretically sound methodology for comparing the performance of a set of algorithms. We exemplify our methodology in Deep Reinforcement Learning (Deep RL). The performance of one execution of a Deep RL algorithm is a random variable. Therefore, several independent executions are needed to evaluate its performance. When comparing algorithms with random performance, a major question concerns the number of executions to perform to ensure that the result of the comparison is theoretically sound. Researchers in Deep RL often use less than 5 independent executions to compare algorithms: we claim that this is not enough in general. Moreover, when comparing more than 2 algorithms at once, we have to use a multiple tests procedure to preserve low error guarantees. We introduce AdaStop, a new statistical test based on multiple group sequential tests. When used to compare algorithms, AdaStop adapts the number of executions to stop as early as possible while ensuring that enough information has been collected to distinguish algorithms that have different score distributions. We prove theoretically that AdaStop has a low probability of making a (family-wise) error. We illustrate the effectiveness of AdaStop in various use-cases, including toy examples and Deep RL algorithms on challenging Mujoco environments. AdaStop is the first statistical test fitted to this sort of comparisons: it is both a significant contribution to statistics, and an important contribution to computational studies performed in reinforcement learning and in other domains.

翻译：近年来，科学界对许多实证研究结果的统计可复现性提出质疑，这一现象在机器学习领域尤为突出。为助力解决当前的可复现性危机，我们提出了一种理论可靠的方法论，用于比较一组算法的性能表现。我们以深度强化学习（Deep RL）为例具体说明该方法论。深度强化学习算法单次运行的性能表现是一个随机变量，因此需要多次独立运行才能准确评估其性能。在比较具有随机性能的算法时，关键问题在于确定需要执行多少次独立运行才能保证比较结果的理论可靠性。当前深度强化学习研究者通常使用少于5次独立运行来比较算法性能：我们认为这在多数情况下是不充分的。此外，当同时比较超过两种算法时，必须采用多重检验程序以维持较低的错误率保证。本文提出AdaStop——一种基于多重序贯群组检验的新型统计检验方法。在用于算法比较时，AdaStop能自适应地确定终止所需的运行次数：在确保已收集足够信息以区分具有不同得分分布的算法的前提下，尽可能早地停止实验。我们通过理论证明AdaStop具有较低的（族系）错误发生概率。我们通过多种应用场景（包括玩具示例和在复杂Mujoco环境中运行的深度强化学习算法）验证了AdaStop的有效性。AdaStop是首个专门适用于此类比较的统计检验方法：这不仅是对统计学领域的重要贡献，也对强化学习及其他领域的计算研究具有显著意义。