Temporal Difference Learning with Compressed Updates: Error-Feedback meets Reinforcement Learning

In large-scale machine learning, recent works have studied the effects of compressing gradients in stochastic optimization in order to alleviate the communication bottleneck. These works have collectively revealed that stochastic gradient descent (SGD) is robust to structured perturbations such as quantization, sparsification, and delays. Perhaps surprisingly, despite the surge of interest in large-scale, multi-agent reinforcement learning, almost nothing is known about the analogous question: Are common reinforcement learning (RL) algorithms also robust to similar perturbations? In this paper, we investigate this question by studying a variant of the classical temporal difference (TD) learning algorithm with a perturbed update direction, where a general compression operator is used to model the perturbation. Our main technical contribution is to show that compressed TD algorithms, coupled with an error-feedback mechanism used widely in optimization, exhibit the same non-asymptotic theoretical guarantees as their SGD counterparts. We then extend our results significantly to nonlinear stochastic approximation algorithms and multi-agent settings. In particular, we prove that for multi-agent TD learning, one can achieve linear convergence speedups in the number of agents while communicating just $\tilde{O}(1)$ bits per agent at each time step. Our work is the first to provide finite-time results in RL that account for general compression operators and error-feedback in tandem with linear function approximation and Markovian sampling. Our analysis hinges on studying the drift of a novel Lyapunov function that captures the dynamics of a memory variable introduced by error feedback.

翻译：在大规模机器学习中，近期研究探讨了随机优化中压缩梯度以缓解通信瓶颈的影响。这些研究共同揭示，随机梯度下降（SGD）对量化、稀疏化和延迟等结构化扰动具有鲁棒性。令人惊讶的是，尽管大规模多智能体强化学习引起了广泛关注，但关于类似问题——常见的强化学习（RL）算法是否也对类似扰动具有鲁棒性——几乎一无所知。本文通过研究经典时序差分（TD）学习算法的一个变体来探讨这一问题，该变体涉及扰动的更新方向，其中使用通用压缩算子来建模扰动。我们的主要技术贡献在于证明，结合优化中广泛使用的误差反馈机制，压缩TD算法展现出与其SGD对应算法相同的非渐近理论保证。随后，我们将结果显著推广到非线性随机逼近算法和多智能体场景。特别地，我们证明对于多智能体TD学习，每个智能体在每个时间步仅传输$\tilde{O}(1)$比特信息时，可以实现与智能体数量成线性关系的收敛速度提升。我们的工作是首个在RL中提供考虑通用压缩算子与误差反馈相结合、线性函数逼近以及马尔可夫采样的有限时间结果。我们的分析依赖于研究一种新颖的李雅普诺夫函数的漂移，该函数捕捉了误差反馈引入的记忆变量的动态特性。