Adopting reasonable strategies is challenging but crucial for an intelligent agent with limited resources working in hazardous, unstructured, and dynamic environments to improve the system's utility, decrease the overall cost, and increase mission success probability. This paper proposes a novel directed acyclic strategy graph decomposition approach based on Bayesian chaining to separate an intricate policy into several simple sub-policies and organize their relationships as Bayesian strategy networks (BSN). We integrate this approach into the state-of-the-art DRL method -- soft actor-critic (SAC), and build the corresponding Bayesian soft actor-critic (BSAC) model by organizing several sub-policies as a joint policy. We compare our method against the state-of-the-art deep reinforcement learning algorithms on the standard continuous control benchmarks in the OpenAI Gym environment. The results demonstrate that the promising potential of the BSAC method significantly improves training efficiency.
翻译:在危险、非结构化和动态环境中,资源受限的智能体需采用合理策略以提升系统效用、降低整体成本并增加任务成功概率,这是一项具有挑战性但至关重要的任务。本文提出一种基于贝叶斯链式法则的新型有向无环策略图分解方法,将复杂策略分离为多个简单子策略,并将其关系组织为贝叶斯策略网络(BSN)。我们将该方法集成至当前最先进的深度强化学习方法——软演员-评论家(SAC)中,通过将多个子策略组织为联合策略,构建了相应的贝叶斯软演员-评论家(BSAC)模型。在OpenAI Gym环境的标准连续控制基准上,我们将所提方法与当前最先进的深度强化学习算法进行了对比。结果表明,BSAC方法在显著提升训练效率方面展现出巨大潜力。