Incomplete knowledge of the environment leads an agent to make decisions under uncertainty. One of the major dilemmas in Reinforcement Learning (RL) where an autonomous agent has to balance two contrasting needs in making its decisions is: exploiting the current knowledge of the environment to maximize the cumulative reward as well as exploring actions that allow improving the knowledge of the environment, hopefully leading to higher reward values (exploration-exploitation trade-off). Concurrently, another relevant issue regards the full observability of the states, which may not be assumed in all applications. For instance, when 2D images are considered as input in an RL approach used for finding the best actions within a 3D simulation environment. In this work, we address these issues by deploying and testing several techniques to balance exploration and exploitation trade-off on partially observable systems for predicting steering wheels in autonomous driving scenarios. More precisely, the final aim is to investigate the effects of using both adaptive and deterministic exploration strategies coupled with a Deep Recurrent Q-Network. Additionally, we adapted and evaluated the impact of a modified quadratic loss function to improve the learning phase of the underlying Convolutional Recurrent Neural Network. We show that adaptive methods better approximate the trade-off between exploration and exploitation and, in general, Softmax and Max-Boltzmann strategies outperform epsilon-greedy techniques.
翻译:环境知识的不完整导致智能体在不确定性下做出决策。强化学习中的一个主要困境是自主智能体在决策时需平衡两种对立需求:利用当前环境知识最大化累积奖励,同时探索能提升环境认知的动作以期望获得更高奖励值(探索-利用权衡)。与此同时,另一个相关问题涉及状态完全可观测性,这在所有应用中并非总能假设成立。例如,当在三维仿真环境中使用强化学习方法寻找最优动作时,若以二维图像作为输入,状态不可观测性问题便会出现。本研究通过部署并测试多种技术来解决部分可观测系统中的探索-利用权衡问题,具体应用于自动驾驶场景中的方向盘预测。更确切地说,最终目标是研究自适应与确定性探索策略结合深度递归Q网络时的效果。此外,本文改进并评估了修正二次损失函数对卷积递归神经网络学习阶段的优化作用。研究表明,自适应方法能更好地逼近探索与利用的平衡,且Softmax和Max-Boltzmann策略总体优于epsilon-贪心技术。