Identify, Estimate and Bound the Uncertainty of Reinforcement Learning for Autonomous Driving

Deep reinforcement learning (DRL) has emerged as a promising approach for developing more intelligent autonomous vehicles (AVs). A typical DRL application on AVs is to train a neural network-based driving policy. However, the black-box nature of neural networks can result in unpredictable decision failures, making such AVs unreliable. To this end, this work proposes a method to identify and protect unreliable decisions of a DRL driving policy. The basic idea is to estimate and constrain the policy's performance uncertainty, which quantifies potential performance drop due to insufficient training data or network fitting errors. By constraining the uncertainty, the DRL model's performance is always greater than that of a baseline policy. The uncertainty caused by insufficient data is estimated by the bootstrapped method. Then, the uncertainty caused by the network fitting error is estimated using an ensemble network. Finally, a baseline policy is added as the performance lower bound to avoid potential decision failures. The overall framework is called uncertainty-bound reinforcement learning (UBRL). The proposed UBRL is evaluated on DRL policies with different amounts of training data, taking an unprotected left-turn driving case as an example. The result shows that the UBRL method can identify potentially unreliable decisions of DRL policy. The UBRL guarantees to outperform baseline policy even when the DRL policy is not well-trained and has high uncertainty. Meanwhile, the performance of UBRL improves with more training data. Such a method is valuable for the DRL application on real-road driving and provides a metric to evaluate a DRL policy.

翻译：深度强化学习（DRL）已成为开发更智能自动驾驶车辆（AVs）的一种有前景的方法。DRL在自动驾驶车辆中的典型应用是训练基于神经网络的驾驶策略。然而，神经网络的“黑箱”特性可能导致不可预测的决策失败，从而使此类自动驾驶车辆不可靠。为此，本文提出一种方法，用于识别并保护DRL驾驶策略中不可靠的决策。其基本思想是估计并约束策略的性能不确定性，该不确定性量化了因训练数据不足或网络拟合误差导致的潜在性能下降。通过约束不确定性，DRL模型的性能始终高于基线策略。由数据不足引起的不确定性通过自助法（bootstrapped）进行估计；而由网络拟合误差引起的不确定性则通过集成网络进行估计。最后，引入基线策略作为性能下界，以避免潜在的决策失败。整体框架称为不确定性约束强化学习（UBRL）。所提出的UBRL方法在不同训练数据量的DRL策略上进行了评估，并以无保护左转驾驶场景为例。结果表明，UBRL方法能够识别DRL策略中潜在不可靠的决策。即使当DRL策略训练不充分且具有高不确定性时，UBRL仍能保证优于基线策略。同时，UBRL的性能随训练数据量的增加而提升。该方法对于DRL在真实道路驾驶中的应用具有重要价值，并为评估DRL策略提供了一种度量标准。