Recently, there have been numerous attempts to enhance the sample efficiency of off-policy reinforcement learning (RL) agents when interacting with the environment, including architecture improvements and new algorithms. Despite these advances, they overlook the potential of directly constraining the initial representations of the input data, which can intuitively alleviate the distribution shift issue and stabilize training. In this paper, we introduce the Tanh function into the initial layer to fulfill such a constraint. We theoretically unpack the convergence property of the temporal difference learning with the Tanh function under linear function approximation. Motivated by theoretical insights, we present our Constrained Initial Representations framework, tagged CIR, which is made up of three components: (i) the Tanh activation along with normalization methods to stabilize representations; (ii) the skip connection module to provide a linear pathway from the shallow layer to the deep layer; (iii) the convex Q-learning that allows a more flexible value estimate and mitigates potential conservatism. Empirical results show that CIR exhibits strong performance on numerous continuous control tasks, even being competitive or surpassing existing strong baseline methods.
翻译:近年来,为提高离线强化学习智能体与环境交互时的样本效率,已有诸多尝试,包括架构改进与新算法设计。尽管这些研究取得了进展,它们却忽视了直接约束输入数据初始表示的潜力——这种方法可直观缓解分布偏移问题并稳定训练过程。本文在初始层引入双曲正切函数以实现此类约束。我们从理论上解析了在线性函数逼近下,采用双曲正切函数的时序差分学习的收敛特性。基于理论洞见,我们提出了带约束初始表示框架(标记为CIR),该框架包含三个组件:(i)结合归一化方法的Tanh激活函数以稳定表示;(ii)跳跃连接模块提供从浅层到深层的线性通路;(iii)凸Q学习算法,支持更灵活的价值估计并缓解潜在保守性。实验结果表明,CIR在众多连续控制任务中表现出强劲性能,甚至可与现有强基线方法竞争或超越之。