This paper aims at the algorithmic/theoretical core of reinforcement learning (RL) by introducing the novel class of proximal Bellman mappings. These mappings are defined in reproducing kernel Hilbert spaces (RKHSs), to benefit from the rich approximation properties and inner product of RKHSs, they are shown to belong to the powerful Hilbertian family of (firmly) nonexpansive mappings, regardless of the values of their discount factors, and possess ample degrees of design freedom to even reproduce attributes of the classical Bellman mappings and to pave the way for novel RL designs. An approximate policy-iteration scheme is built on the proposed class of mappings to solve the problem of selecting online, at every time instance, the "optimal" exponent $p$ in a $p$-norm loss to combat outliers in linear adaptive filtering, without training data and any knowledge on the statistical properties of the outliers. Numerical tests on synthetic data showcase the superior performance of the proposed framework over several non-RL and kernel-based RL schemes.
翻译:本文旨在通过引入一类新型的近端Bellman映射,深入探讨强化学习的算法与理论基础。这些映射定义在再生核希尔伯特空间中,以充分利用再生核希尔伯特空间丰富的逼近性质与内积结构。研究表明,无论折扣因子取值如何,此类映射均属于(严格)非扩张映射的强大希尔伯特族,并拥有充分的设计自由度,不仅能复现经典Bellman映射的特征,还能为新型强化学习设计开辟道路。基于所提出的映射族,本文构建了一种近似策略迭代方案,用于解决线性自适应滤波中在线选择每一时刻$p$-范数损失函数中“最优”指数$p$的问题,从而有效抑制异常值的影响,且无需训练数据及异常值统计特性的任何先验知识。在合成数据上的数值实验表明,所提框架的性能优于多种非强化学习和基于核的强化学习方案。