Sequential decision-making in high-dimensional continuous action spaces, particularly in stochastic environments, faces significant computational challenges. We explore this challenge in the traditional offline RL setting, where an agent must learn how to make decisions based on data collected through a stochastic behavior policy. We present \textit{Latent Macro Action Planner} (L-MAP), which addresses this challenge by learning a set of temporally extended macro-actions through a state-conditional Vector Quantized Variational Autoencoder (VQ-VAE), effectively reducing action dimensionality. L-MAP employs a (separate) learned prior model that acts as a latent transition model and allows efficient sampling of plausible actions. During planning, our approach accounts for stochasticity in both the environment and the behavior policy by using Monte Carlo tree search (MCTS). In offline RL settings, including stochastic continuous control tasks, L-MAP efficiently searches over discrete latent actions to yield high expected returns. Empirical results demonstrate that L-MAP maintains low decision latency despite increased action dimensionality. Notably, across tasks ranging from continuous control with inherently stochastic dynamics to high-dimensional robotic hand manipulation, L-MAP significantly outperforms existing model-based methods and performs on-par with strong model-free actor-critic baselines, highlighting the effectiveness of the proposed approach in planning in complex and stochastic environments with high-dimensional action spaces.
翻译:在高维连续动作空间中进行序列决策,尤其是在随机环境中,面临着显著的计算挑战。我们在传统离线强化学习设置中探讨这一挑战,其中智能体必须基于通过随机行为策略收集的数据学习如何做出决策。我们提出了\textit{潜在宏动作规划器}(L-MAP),它通过状态条件向量量化变分自编码器(VQ-VAE)学习一组时间扩展的宏动作,有效降低动作维度,从而应对这一挑战。L-MAP采用一个(独立的)学习先验模型作为潜在转移模型,实现对合理动作的高效采样。在规划过程中,我们的方法通过使用蒙特卡洛树搜索(MCTS)来考虑环境和行为策略中的随机性。在包括随机连续控制任务在内的离线强化学习设置中,L-MAP高效搜索离散潜在动作,以产生高期望回报。实证结果表明,尽管动作维度增加,L-MAP仍能保持较低的决策延迟。值得注意的是,在从具有固有随机动态的连续控制到高维机器人手部操纵的一系列任务中,L-MAP显著优于现有的基于模型的方法,并与强大的无模型演员-评论家基线方法表现相当,突显了所提方法在具有高维动作空间的复杂随机环境中进行规划的有效性。