Deep Actor-Critic algorithms, which combine Actor-Critic with deep neural network (DNN), have been among the most prevalent reinforcement learning algorithms for decision-making problems in simulated environments. However, the existing deep Actor-Critic algorithms are still not mature to solve realistic problems with non-convex stochastic constraints and high cost to interact with the environment. In this paper, we propose a single-loop deep Actor-Critic (SLDAC) algorithmic framework for general constrained reinforcement learning (CRL) problems. In the actor step, the constrained stochastic successive convex approximation (CSSCA) method is applied to handle the non-convex stochastic objective and constraints. In the critic step, the critic DNNs are only updated once or a few finite times for each iteration, which simplifies the algorithm to a single-loop framework (the existing works require a sufficient number of updates for the critic step to ensure a good enough convergence of the inner loop for each iteration). Moreover, the variance of the policy gradient estimation is reduced by reusing observations from the old policy. The single-loop design and the observation reuse effectively reduce the agent-environment interaction cost and computational complexity. In spite of the biased policy gradient estimation incurred by the single-loop design and observation reuse, we prove that the SLDAC with a feasible initial point can converge to a Karush-Kuhn-Tuker (KKT) point of the original problem almost surely. Simulations show that the SLDAC algorithm can achieve superior performance with much lower interaction cost.
翻译:深度演员-评论家算法将演员-评论家框架与深度神经网络相结合,已成为模拟环境中决策问题最普遍的强化学习算法之一。然而,现有深度演员-评论家算法在解决具有非凸随机约束且与环境交互成本高昂的实际问题时仍不成熟。本文针对一般约束强化学习问题,提出一种单循环深度演员-评论家算法框架。在演员步骤中,采用约束随机逐次凸逼近方法处理非凸随机目标函数与约束条件;在评论家步骤中,每个迭代周期仅对评论家深度神经网络进行一次或有限次更新,从而将算法简化为单循环框架(现有方法需在评论家步骤进行充分更新以确保每次迭代内循环的足够收敛性)。此外,通过重用旧策略的观测数据降低了策略梯度估计的方差。单循环设计与观测重用机制有效降低了智能体与环境的交互成本及计算复杂度。尽管单循环设计和观测重用会导致策略梯度估计存在偏差,我们证明了从可行初始点出发的SLDAC算法几乎必然收敛至原问题的Karush-Kuhn-Tucker点。仿真实验表明,该算法能以显著降低的交互成本实现优越性能。