Deep reinforcement learning (RL) works impressively in some environments and fails catastrophically in others. Ideally, RL theory should be able to provide an understanding of why this is, i.e. bounds predictive of practical performance. Unfortunately, current theory does not quite have this ability. We compare standard deep RL algorithms to prior sample complexity prior bounds by introducing a new dataset, BRIDGE. It consists of 155 MDPs from common deep RL benchmarks, along with their corresponding tabular representations, which enables us to exactly compute instance-dependent bounds. We find that prior bounds do not correlate well with when deep RL succeeds vs. fails, but discover a surprising property that does. When actions with the highest Q-values under the random policy also have the highest Q-values under the optimal policy, deep RL tends to succeed; when they don't, deep RL tends to fail. We generalize this property into a new complexity measure of an MDP that we call the effective horizon, which roughly corresponds to how many steps of lookahead search are needed in order to identify the next optimal action when leaf nodes are evaluated with random rollouts. Using BRIDGE, we show that the effective horizon-based bounds are more closely reflective of the empirical performance of PPO and DQN than prior sample complexity bounds across four metrics. We also show that, unlike existing bounds, the effective horizon can predict the effects of using reward shaping or a pre-trained exploration policy.
翻译:深度强化学习(RL)在某些环境中表现惊人,而在其他环境中却灾难性地失败。理想情况下,RL理论应能解释这一现象,即提供可预测实际性能的界。遗憾的是,现有理论尚不具备这种能力。本文通过引入新数据集BRIDGE,将标准深度RL算法与先验样本复杂度界进行对比。该数据集包含来自常见深度RL基准的155个马尔可夫决策过程(MDP)及其对应的表格表示,使我们能够精确计算实例相关界。我们发现,先验界与深度RL成功或失败的相关性较弱,但揭示了一个令人惊讶的特性:当随机策略下具有最高Q值的动作同时也是最优策略下具有最高Q值的动作时,深度RL往往成功;反之则往往失败。我们将此特性推广为一种新的MDP复杂度度量——有效视界(Effective Horizon),它大致对应着在叶节点通过随机rollout评估时,为确定下一个最优动作所需的前瞻搜索步数。利用BRIDGE,我们证明基于有效视界的界在四个指标上比先验样本复杂度界更能反映PPO和DQN的经验性能。我们还表明,与现有界不同,有效视界能够预测使用奖励塑造或预训练探索策略的效果。