Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL

The divergence of the Q-value estimation has been a prominent issue in offline RL, where the agent has no access to real dynamics. Traditional beliefs attribute this instability to querying out-of-distribution actions when bootstrapping value targets. Though this issue can be alleviated with policy constraints or conservative Q estimation, a theoretical understanding of the underlying mechanism causing the divergence has been absent. In this work, we aim to thoroughly comprehend this mechanism and attain an improved solution. We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL. Then, we propose a novel Self-Excite Eigenvalue Measure (SEEM) metric based on Neural Tangent Kernel (NTK) to measure the evolving property of Q-network at training, which provides an intriguing explanation of the emergence of divergence. For the first time, our theory can reliably decide whether the training will diverge at an early stage, and even predict the order of the growth for the estimated Q-value, the model's norm, and the crashing step when an SGD optimizer is used. The experiments demonstrate perfect alignment with this theoretic analysis. Building on our insights, we propose to resolve divergence from a novel perspective, namely improving the model's architecture for better extrapolating behavior. Through extensive empirical studies, we identify LayerNorm as a good solution to effectively avoid divergence without introducing detrimental bias, leading to superior performance. Experimental results prove that it can still work in some most challenging settings, i.e. using only 1 transitions of the dataset, where all previous methods fail. Moreover, it can be easily plugged into modern offline RL methods and achieve SOTA results on many challenging tasks. We also give unique insights into its effectiveness.

翻译：摘要：在离线强化学习中，由于代理无法访问真实动态系统，Q值估计发散问题尤为突出。传统观点认为，这种不稳定性源于对价值目标进行自举估计时查询了分布外动作。尽管通过策略约束或保守Q估计可缓解该问题，但导致发散的根本机制仍缺乏理论理解。本研究旨在彻底解析这一机制并寻求更优解决方案。我们首先发现基础模式“自激效应”是离线RL中Q值估计发散的主因。基于神经正切核（NTK）提出新型自激特征值度量（SEEM）指标，用于刻画Q网络训练过程中的演化特性，为发散现象的出现提供了令人信服的解释。我们的理论首次能可靠判定训练初期是否会出现发散，甚至可预测使用SGD优化器时估计Q值的增长阶数、模型范数及崩溃步数。实验与理论分析完美契合。基于此洞察，我们创新性地提出通过改进模型架构优化外推行为来解决发散问题。大量实证研究表明，LayerNorm能有效避免发散且不引入有害偏差，从而获得卓越性能。实验结果证明，在最具挑战性的场景下（例如仅使用数据集的1个转换样本），该方法仍能奏效，而此前所有方法均失败。此外，该方法可轻松接入现有离线RL方法，并在多项困难任务上取得SOTA结果。我们还对其有效性提供了独特见解。