The average reward criterion is relatively less studied as most existing works in the Reinforcement Learning literature consider the discounted reward criterion. There are few recent works that present on-policy average reward actor-critic algorithms, but average reward off-policy actor-critic is relatively less explored. In this work, we present both on-policy and off-policy deterministic policy gradient theorems for the average reward performance criterion. Using these theorems, we also present an Average Reward Off-Policy Deep Deterministic Policy Gradient (ARO-DDPG) Algorithm. We first show asymptotic convergence analysis using the ODE-based method. Subsequently, we provide a finite time analysis of the resulting stochastic approximation scheme with linear function approximator and obtain an $\epsilon$-optimal stationary policy with a sample complexity of $\Omega(\epsilon^{-2.5})$. We compare the average reward performance of our proposed ARO-DDPG algorithm and observe better empirical performance compared to state-of-the-art on-policy average reward actor-critic algorithms over MuJoCo-based environments.
翻译:平均奖励准则在强化学习文献中研究较少,因为现有工作大多考虑折扣奖励准则。近期有少数工作提出了同策略平均奖励演员-评论家算法,但平均奖励离策略演员-评论家算法相对较少被探索。本文针对平均奖励性能准则,提出了同策略和离策略的确定性策略梯度定理。基于这些定理,我们进一步提出了一种平均奖励离策略深度确定性策略梯度(ARO-DDPG)算法。首先,我们使用基于常微分方程的方法进行了渐近收敛性分析。随后,我们对采用线性函数逼近器的随机逼近方案给出了有限时间分析,并获得了样本复杂度为Ω(ε^{-2.5})的ε-最优平稳策略。我们将所提出的ARO-DDPG算法与现有最优同策略平均奖励演员-评论家算法在基于MuJoCo的环境中进行平均奖励性能对比,观察到更好的实证性能。