Research in machine learning is making progress in fixing its own reproducibility crisis. Reinforcement learning (RL), in particular, faces its own set of unique challenges. Comparison of point estimates, and plots that show successful convergence to the optimal policy during training, may obfuscate overfitting or dependence on the experimental setup. Although researchers in RL have proposed reliability metrics that account for uncertainty to better understand each algorithm's strengths and weaknesses, the recommendations of past work do not assume the presence of out-of-distribution observations. We propose a set of evaluation methods that measure the robustness of RL algorithms under distribution shifts. The tools presented here argue for the need to account for performance over time while the agent is acting in its environment. In particular, we recommend time series analysis as a method of observational RL evaluation. We also show that the unique properties of RL and simulated dynamic environments allow us to make stronger assumptions to justify the measurement of causal impact in our evaluations. We then apply these tools to single-agent and multi-agent environments to show the impact of introducing distribution shifts during test time. We present this methodology as a first step toward rigorous RL evaluation in the presence of distribution shifts.
翻译:机器学习研究正在努力解决其自身的可重复性危机。尤其是强化学习,面临着一系列独特的挑战。对点估计值的比较以及训练过程中成功收敛到最优策略的图示,可能掩盖过拟合问题或对实验设置的依赖性。尽管强化学习研究者提出了考虑不确定性的可靠性指标,以更好地理解每种算法的优势与不足,但以往工作的建议并未假设存在分布外观测。我们提出了一组评估方法,用于衡量强化学习算法在分布偏移下的鲁棒性。本文介绍的工具强调了在智能体与其环境交互过程中,需要考虑其性能随时间的变化。特别是,我们推荐将时间序列分析作为观测性强化学习评估的一种方法。我们还表明,强化学习和模拟动态环境的独特性质使我们能够做出更强的假设,从而在我们的评估中证明因果效应测量的合理性。随后,我们将这些工具应用于单智能体和多智能体环境,以展示在测试时引入分布偏移所产生的影响。我们提出这一方法论,作为在存在分布偏移的情况下进行严格强化学习评估的第一步。