In this paper, we tackle the challenging problem of delayed rewards in reinforcement learning (RL). While Proximal Policy Optimization (PPO) has emerged as a leading Policy Gradient method, its performance can degrade under delayed rewards. We introduce two key enhancements to PPO: a hybrid policy architecture that combines an offline policy (trained on expert demonstrations) with an online PPO policy, and a reward shaping mechanism using Time Window Temporal Logic (TWTL). The hybrid architecture leverages offline data throughout training while maintaining PPO's theoretical guarantees. Building on the monotonic improvement framework of Trust Region Policy Optimization (TRPO), we prove that our approach ensures improvement over both the offline policy and previous iterations, with a bounded performance gap of $(2\varsigma\gamma\alpha^2)/(1-\gamma)^2$, where $\alpha$ is the mixing parameter, $\gamma$ is the discount factor, and $\varsigma$ bounds the expected advantage. Additionally, we prove that our TWTL-based reward shaping preserves the optimal policy of the original problem. TWTL enables formal translation of temporal objectives into immediate feedback signals that guide learning. We demonstrate the effectiveness of our approach through extensive experiments on an inverted pendulum and a lunar lander environments, showing improvements in both learning speed and final performance compared to standard PPO and offline-only approaches.
翻译:本文针对强化学习中的延迟奖励这一挑战性问题展开研究。尽管近端策略优化已成为领先的策略梯度方法,但其在延迟奖励条件下的性能可能下降。我们为PPO引入了两项关键改进:一种将离线策略(基于专家示范训练)与在线PPO策略相结合的混合策略架构,以及使用时间窗口时序逻辑的奖励塑形机制。该混合架构在整个训练过程中利用离线数据,同时保持PPO的理论保证。基于信任域策略优化的单调改进框架,我们证明该方法能确保相对于离线策略及先前迭代的性能提升,其性能差距上界为$(2\varsigma\gamma\alpha^2)/(1-\gamma)^2$,其中$\alpha$为混合参数,$\gamma$为折扣因子,$\varsigma$为期望优势的上界。此外,我们证明了基于TWTL的奖励塑形能够保持原始问题的最优策略不变。TWTL可将时序目标形式化地转化为即时反馈信号以指导学习。通过在倒立摆和月球着陆器环境中进行大量实验,我们验证了该方法的有效性,结果表明相较于标准PPO及纯离线方法,本方法在学习速度和最终性能上均有提升。