SOAP-RL: Sequential Option Advantage Propagation for Reinforcement Learning in POMDP Environments

This work compares ways of extending Reinforcement Learning algorithms to Partially Observed Markov Decision Processes (POMDPs) with options. One view of options is as temporally extended action, which can be realized as a memory that allows the agent to retain historical information beyond the policy's context window. While option assignment could be handled using heuristics and hand-crafted objectives, learning temporally consistent options and associated sub-policies without explicit supervision is a challenge. Two algorithms, PPOEM and SOAP, are proposed and studied in depth to address this problem. PPOEM applies the forward-backward algorithm (for Hidden Markov Models) to optimize the expected returns for an option-augmented policy. However, this learning approach is unstable during on-policy rollouts. It is also unsuited for learning causal policies without the knowledge of future trajectories, since option assignments are optimized for offline sequences where the entire episode is available. As an alternative approach, SOAP evaluates the policy gradient for an optimal option assignment. It extends the concept of the generalized advantage estimation (GAE) to propagate option advantages through time, which is an analytical equivalent to performing temporal back-propagation of option policy gradients. This option policy is only conditional on the history of the agent, not future actions. Evaluated against competing baselines, SOAP exhibited the most robust performance, correctly discovering options for POMDP corridor environments, as well as on standard benchmarks including Atari and MuJoCo, outperforming PPOEM, as well as LSTM and Option-Critic baselines. The open-sourced code is available at https://github.com/shuishida/SoapRL.

翻译：本研究比较了将强化学习算法扩展至带选项的部分可观测马尔可夫决策过程（POMDPs）的不同方法。选项可视为时间扩展动作，其实现方式相当于一种记忆机制，使智能体能够保留超出策略上下文窗口的历史信息。虽然选项分配可通过启发式方法和人工设计的目标函数处理，但在无显式监督的情况下学习时间一致的选项及相关子策略仍具挑战性。为此，本文提出并深入研究了PPOEM与SOAP两种算法。PPOEM应用前向-后向算法（针对隐马尔可夫模型）来优化选项增强策略的期望回报。然而，该学习方法在在线策略执行过程中不稳定，且不适用于在未知未来轨迹的情况下学习因果策略——因为其选项分配是针对可获得完整回合数据的离线序列进行优化的。作为替代方案，SOAP通过评估最优选项分配的策略梯度，将广义优势估计（GAE）的概念扩展为跨时间步传播选项优势，这在解析上等效于执行选项策略梯度的时间反向传播。该选项策略仅取决于智能体的历史信息，而与未来动作无关。经与基线方法对比评估，SOAP展现出最稳健的性能：在POMDP走廊环境中能正确发现选项，在Atari和MuJoCo标准基准测试中亦优于PPOEM、LSTM及Option-Critic等基线方法。开源代码发布于https://github.com/shuishida/SoapRL。