Autonomous cyber and cyber-physical systems need to perform decision-making, learning, and control in unknown environments. Such decision-making can be sensitive to multiple factors, including modeling errors, changes in costs, and impacts of events in the tails of probability distributions. Although multi-agent reinforcement learning (MARL) provides a framework for learning behaviors through repeated interactions with the environment by minimizing an average cost, it will not be adequate to overcome the above challenges. In this paper, we develop a distributed MARL approach to solve decision-making problems in unknown environments by learning risk-aware actions. We use the conditional value-at-risk (CVaR) to characterize the cost function that is being minimized, and define a Bellman operator to characterize the value function associated to a given state-action pair. We prove that this operator satisfies a contraction property, and that it converges to the optimal value function. We then propose a distributed MARL algorithm called the CVaR QD-Learning algorithm, and establish that value functions of individual agents reaches consensus. We identify several challenges that arise in the implementation of the CVaR QD-Learning algorithm, and present solutions to overcome these. We evaluate the CVaR QD-Learning algorithm through simulations, and demonstrate the effect of a risk parameter on value functions at consensus.
翻译:自主网络系统与网络物理系统需要在未知环境中执行决策、学习与控制。此类决策可能对多种因素敏感,包括建模误差、成本变化以及概率分布尾部事件的影响。尽管多智能体强化学习(MARL)通过最小化平均成本,在与环境重复交互中提供行为学习框架,但仍不足以应对上述挑战。本文提出一种分布式MARL方法,通过学习风险感知动作来解决未知环境中的决策问题。我们采用条件风险价值(CVaR)刻画被最小化的成本函数,并定义贝尔曼算子表征给定状态-动作对的价值函数。我们证明该算子满足压缩性质,且收敛于最优价值函数。随后提出名为CVaR QD-Learning的分布式MARL算法,并证明个体智能体的价值函数能达到一致性。我们识别了CVaR QD-Learning算法实现中面临的若干挑战,并提出解决方案。通过仿真评估CVaR QD-Learning算法,展示了风险参数对一致性状态下价值函数的影响。