Deep Reinforcement Learning (DRL) has made considerable advances in simulated and physical robot control tasks, especially when problems admit a fully observed Markov Decision Process (MDP) formulation. When observations only partially capture the underlying state, the problem becomes a Partially Observable MDP (POMDP), and performance rankings between algorithms can change. We empirically compare Proximal Policy Optimization (PPO), Twin Delayed Deep Deterministic Policy Gradient (TD3), and Soft Actor-Critic (SAC) on representative POMDP variants of continuous-control benchmarks. Contrary to widely reported MDP results where TD3 and SAC typically outperform PPO, we observe an inversion: PPO attains higher robustness under partial observability. We attribute this to the stabilizing effect of multi-step bootstrapping. Furthermore, incorporating multi-step targets into TD3 (MTD3) and SAC (MSAC) improves their robustness. These findings provide practical guidance for selecting and adapting DRL algorithms in partially observable settings without requiring new theoretical machinery.
翻译:深度强化学习(DRL)在模拟和物理机器人控制任务中取得了显著进展,尤其是当问题可建模为完全可观测马尔可夫决策过程(MDP)时。当观测仅部分捕获底层状态时,问题转化为部分可观测MDP(POMDP),算法间的性能排名可能发生变化。我们针对连续控制基准测试中的代表性POMDP变体,对近端策略优化(PPO)、双延迟深度确定性策略梯度(TD3)和软演员-评论家(SAC)进行了实证比较。与广泛报道的MDP结果(其中TD3和SAC通常优于PPO)相反,我们观察到逆转:PPO在部分可观测条件下获得了更高的鲁棒性。我们将此归因于多步自举法的稳定化效应。此外,将多步目标纳入TD3(MTD3)和SAC(MSAC)可提升其鲁棒性。这些发现为在无需引入新理论机制的情况下,选择与调整部分可观测设置中的DRL算法提供了实用指导。