We extend the provably convergent Full Gradient DQN algorithm for discounted reward Markov decision processes from Avrachenkov et al. (2021) to average reward problems. We experimentally compare widely used RVI Q-Learning with recently proposed Differential Q-Learning in the neural function approximation setting with Full Gradient DQN and DQN. We also extend this to learn Whittle indices for Markovian restless multi-armed bandits. We observe a better convergence rate of the proposed Full Gradient variant across different tasks.
翻译:我们将Avrachenkov等人(2021)针对折扣奖励马尔可夫决策过程提出的可证明收敛的全梯度DQN算法扩展至平均奖励问题。我们在神经函数逼近设置下,将广泛使用的RVI Q学习与最新提出的差分Q学习,以及全梯度DQN和DQN进行实验对比。我们还将此方法扩展至马尔可夫型休止多臂赌博机中的Whittle指标学习。实验观察到,所提出的全梯度变体在不同任务中展现出更优的收敛速度。