Incomplete knowledge of the environment leads an agent to make decisions under uncertainty. One of the major dilemmas in Reinforcement Learning (RL) where an autonomous agent has to balance two contrasting needs in making its decisions is: exploiting the current knowledge of the environment to maximize the cumulative reward as well as exploring actions that allow improving the knowledge of the environment, hopefully leading to higher reward values (exploration-exploitation trade-off). Concurrently, another relevant issue regards the full observability of the states, which may not be assumed in all applications. Such as when only 2D images are considered as input in a RL approach used for finding the optimal action within a 3D simulation environment. In this work, we address these issues by deploying and testing several techniques to balance exploration and exploitation trade-off on partially observable systems for predicting steering wheels in autonomous driving scenario. More precisely, the final aim is to investigate the effects of using both stochastic and deterministic multi-armed bandit strategies coupled with a Deep Recurrent Q-Network. Additionally, we adapted and evaluated the impact of an innovative method to improve the learning phase of the underlying Convolutional Recurrent Neural Network. We aim to show that adaptive stochastic methods for exploration better approximate the trade-off between exploration and exploitation as, in general, Softmax and Max-Boltzmann strategies are able to outperform epsilon-greedy techniques.
翻译:在环境信息不完全的情况下,智能体需要在不确定性中做出决策。强化学习中自主智能体在决策时面临的核心困境之一是如何平衡两种对立的需求:利用当前环境知识最大化累积奖励,同时探索能够改善环境知识的动作,以期获得更高奖励值(探索-利用权衡)。与此同时,另一个相关问题涉及状态的全可观测性——这在所有应用场景中未必都能成立。例如,当强化学习方法仅以2D图像作为输入,却需在3D仿真环境中寻找最优动作时。本研究通过部署并测试多种平衡探索-利用权衡的技术,在自动驾驶场景中部分可观测系统下的方向盘预测问题上,对这些挑战展开探讨。具体而言,本文最终目标是研究将随机性和确定性多臂老虎机策略与深度循环Q网络结合使用的影响。此外,我们改进并评估了一种创新方法对底层卷积循环神经网络学习阶段的提升效果。研究旨在证明自适应随机探索方法能更精确地逼近探索-利用权衡——通常Softmax和Max-Boltzmann策略的表现优于ε-贪心技术。