The use of reinforcement learning (RL) in practical applications requires considering sub-optimal outcomes, which depend on the agent's familiarity with the uncertain environment. Dynamically adjusting the level of epistemic risk over the course of learning can tactically achieve reliable optimal policy in safety-critical environments and tackle the sub-optimality of a static risk level. In this work, we introduce a novel framework, Distributional RL with Online Risk Adaption (DRL-ORA), which can quantify the aleatory and epistemic uncertainties compositely and dynamically select the epistemic risk levels via solving a total variation minimization problem online. The risk level selection can be efficiently achieved through grid search using a Follow-The-Leader type algorithm, and its offline oracle is related to "satisficing measure" (in the decision analysis community) under a special modification of the loss function. We show multiple classes of tasks where DRL-ORA outperforms existing methods that rely on either a fixed risk level or manually predetermined risk level adaption. Given the simplicity of our modifications, we believe the framework can be easily incorporated into most RL algorithm variants.
翻译:在实际应用中,强化学习(RL)需考虑次优结果,而这取决于智能体对不确定环境的熟悉程度。在学习过程中动态调整认知风险水平,可在安全关键环境中策略性地实现可靠的最优策略,并解决静态风险水平的次优性问题。本文提出一种新颖框架——在线风险自适应分布强化学习(DRL-ORA),该框架能复合量化偶然不确定性和认知不确定性,并通过在线求解总变差最小化问题动态选择认知风险水平。风险水平选择可通过基于"跟随领先者"型算法的网格搜索高效实现,其离线最优解与损失函数特殊修正下的"满意测度"(决策分析领域术语)相关联。我们展示了多类任务,在这些任务中DRL-ORA优于依赖固定风险水平或人工预设风险水平自适应策略的现有方法。鉴于本文改进方法的简洁性,我们相信该框架可轻松集成至大多数RL算法变体中。