Bridging RL Theory and Practice with the Effective Horizon

Deep reinforcement learning (RL) works impressively in some environments and fails catastrophically in others. Ideally, RL theory should be able to provide an understanding of why this is, i.e. bounds predictive of practical performance. Unfortunately, current theory does not quite have this ability. We compare standard deep RL algorithms to prior sample complexity bounds by introducing a new dataset, BRIDGE. It consists of 155 deterministic MDPs from common deep RL benchmarks, along with their corresponding tabular representations, which enables us to exactly compute instance-dependent bounds. We choose to focus on deterministic environments because they share many interesting properties of stochastic environments, but are easier to analyze. Using BRIDGE, we find that prior bounds do not correlate well with when deep RL succeeds vs. fails, but discover a surprising property that does. When actions with the highest Q-values under the random policy also have the highest Q-values under the optimal policy (i.e. when it is optimal to be greedy on the random policy's Q function), deep RL tends to succeed; when they don't, deep RL tends to fail. We generalize this property into a new complexity measure of an MDP that we call the effective horizon, which roughly corresponds to how many steps of lookahead search would be needed in that MDP in order to identify the next optimal action, when leaf nodes are evaluated with random rollouts. Using BRIDGE, we show that the effective horizon-based bounds are more closely reflective of the empirical performance of PPO and DQN than prior sample complexity bounds across four metrics. We also find that, unlike existing bounds, the effective horizon can predict the effects of using reward shaping or a pre-trained exploration policy. Our code and data are available at https://github.com/cassidylaidlaw/effective-horizon

翻译：深度强化学习在部分环境中表现出色，却在另一些环境中遭遇灾难性失败。理想情况下，强化学习理论应能提供对此现象的解释，即能预测实际性能的界限。然而，当前的理论尚不具备这一能力。通过引入新数据集BRIDGE，我们将标准深度强化学习算法与现有样本复杂度界进行对比。该数据集包含来自常见深度强化学习基准的155个确定性马尔可夫决策过程及其对应的表格化表示，使我们能够精确计算实例相关界限。我们选择聚焦于确定性环境，是因为其兼具随机环境的诸多有趣特性且更易分析。借助BRIDGE，我们发现现有界限与深度强化学习成功与否的相关性较弱，但意外发现了一个具有预测能力的关键特性：当随机策略下具有最高Q值的动作在最优策略下也保持最高Q值（即对随机策略的Q函数采用贪心策略可达到最优）时，深度强化学习往往成功；反之则往往失败。我们将这一特性归纳为衡量马尔可夫决策过程复杂度的新指标——有效视界，其大致对应在叶节点通过随机展开评估时，为确定下一最优动作在该马尔可夫决策过程中所需的多步前瞻搜索步数。通过BRIDGE，我们证明了基于有效视界的界限在四个评估指标上比现有样本复杂度界更能反映PPO与DQN的实际性能。我们还发现，与现有界限不同，有效视界能预测奖励塑形或预训练探索策略的效果。我们的代码与数据见https://github.com/cassidylaidlaw/effective-horizon