Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL

The divergence of the Q-value estimation has been a prominent issue in offline RL, where the agent has no access to real dynamics. Traditional beliefs attribute this instability to querying out-of-distribution actions when bootstrapping value targets. Though this issue can be alleviated with policy constraints or conservative Q estimation, a theoretical understanding of the underlying mechanism causing the divergence has been absent. In this work, we aim to thoroughly comprehend this mechanism and attain an improved solution. We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL. Then, we propose a novel Self-Excite Eigenvalue Measure (SEEM) metric based on Neural Tangent Kernel (NTK) to measure the evolving property of Q-network at training, which provides an intriguing explanation of the emergence of divergence. For the first time, our theory can reliably decide whether the training will diverge at an early stage, and even predict the order of the growth for the estimated Q-value, the model's norm, and the crashing step when an SGD optimizer is used. The experiments demonstrate perfect alignment with this theoretic analysis. Building on our insights, we propose to resolve divergence from a novel perspective, namely improving the model's architecture for better extrapolating behavior. Through extensive empirical studies, we identify LayerNorm as a good solution to effectively avoid divergence without introducing detrimental bias, leading to superior performance. Experimental results prove that it can still work in some most challenging settings, i.e. using only 1 transitions of the dataset, where all previous methods fail. Moreover, it can be easily plugged into modern offline RL methods and achieve SOTA results on many challenging tasks. We also give unique insights into its effectiveness.

翻译：Q值估计发散一直是离线强化学习（offline RL）中的一个突出问题，其中智能体无法获取真实的动力学信息。传统观点认为，这种不稳定性源于在引导价值目标时查询了分布外动作。尽管可以通过策略约束或保守Q值估计来缓解该问题，但导致发散的根本机制在理论上仍缺乏理解。在本工作中，我们旨在彻底理解这一机制并寻求改进方案。首先，我们识别出一个基本模式——自激（self-excitation）——作为离线强化学习中Q值估计发散的主要原因。然后，我们基于神经正切核（NTK）提出了一种新颖的自激特征值度量（SEEM）指标，用于衡量训练过程中Q网络的演化特性，从而为发散的产生提供了深入解释。这是首次能够在理论上可靠地判断训练是否会在早期阶段发散，甚至能够预测估计Q值增长顺序、模型范数以及使用SGD优化器时的崩溃步骤。实验结果表明，该理论分析与实际结果完美吻合。基于我们的洞察，我们从一个新的角度提出解决发散问题的方法，即改进模型架构以提升外推能力。通过广泛的实证研究，我们确定LayerNorm是一种有效的解决方案，能够在不引入有害偏差的情况下避免发散，从而实现卓越的性能。实验证明，即使在最具挑战性的设置中（例如仅使用数据集中的1个转换），该方法仍能有效工作——而所有先前方法均在此情况下失败。此外，它可以轻松嵌入到现代离线强化学习方法中，并在许多困难任务上取得最先进（SOTA）结果。我们还对其有效性提供了独到的见解。