The family of linear recurrent neural networks has shown strong performance as recurrent memory units in partially observable reinforcement learning. We provide a theoretical justification for their empirical effectiveness by constructing and studying two linear filters: (i) the first exactly reproduces the pre-softmax logits of the belief vector in a hidden Markov model (HMM) under a deterministic transition matrix, thereby serving as a sufficient statistic for optimal policy learning, (ii) the second achieves vanishing state-decoding error under a nearly deterministic transition matrix, thus reducing state ambiguity to near zero. The results extend to action-controlled HMMs, where the corresponding linear filters become time-varying with action-dependent dynamics. We illustrate our main results through numerical experiments and further show that the constructed linear filter serves as a strong feature extractor in a small reinforcement learning game.
翻译:线性循环神经网络族作为循环记忆单元,在部分可观测强化学习中展现出强劲性能。我们通过构造并研究两类线性滤波器,为其经验有效性提供理论依据:(i)第一类在线性确定性转移矩阵条件下,精确复现隐马尔可夫模型(HMM)中信念向量经softmax前的logits值,从而构成最优策略学习的充分统计量;(ii)第二类在近似确定性转移矩阵条件下,实现状态解码误差趋近于零,进而将状态模糊性降至最低。该结论可推广至动作控制型隐马尔可夫模型,此时对应线性滤波器将随动作依赖动态特性成为时变系统。我们通过数值实验验证主要结论,并进一步展示所构造的线性滤波器在小规模强化学习博弈中可作为强特征提取器。