Hierarchical goal-conditioned reinforcement learning (H-GCRL) provides a powerful framework for tackling complex, long-horizon tasks by decomposing them into structured subgoals. However, its practical adoption is hindered by poor data efficiency and limited policy expressivity, especially in offline or data-scarce regimes. In this work, Normalizing flow-based hierarchical implicit Q-learning (NF-HIQL), a novel framework that replaces unimodal gaussian policies with expressive normalizing flow policies at both the high- and low-levels of the hierarchy is introduced. This design enables tractable log-likelihood computation, efficient sampling, and the ability to model rich multimodal behaviors. New theoretical guarantees are derived, including explicit KL-divergence bounds for Real-valued non-volume preserving (RealNVP) policies and PAC-style sample efficiency results, showing that NF-HIQL preserves stability while improving generalization. Empirically, NF-HIQL is evaluted across diverse long-horizon tasks in locomotion, ball-dribbling, and multi-step manipulation from OGBench. NF-HIQL consistently outperforms prior goal-conditioned and hierarchical baselines, demonstrating superior robustness under limited data and highlighting the potential of flow-based architectures for scalable, data-efficient hierarchical reinforcement learning.
翻译:分层目标条件强化学习(H-GCRL)通过将复杂的长时程任务分解为结构化的子目标,为解决此类问题提供了一个强大的框架。然而,其实际应用受到数据效率低下和策略表达能力有限的阻碍,尤其是在离线或数据稀缺的场景下。本文介绍了基于归一化流的分层隐式Q学习(NF-HIQL),这是一种新颖的框架,它在分层结构的高层和低层均使用表达能力强的归一化流策略替代了单峰高斯策略。该设计实现了易处理的似然对数计算、高效采样以及对丰富多模态行为的建模能力。我们推导了新的理论保证,包括实值非体积保持(RealNVP)策略的显式KL散度边界以及PAC风格的样本效率结果,表明NF-HIQL在保持稳定性的同时提升了泛化能力。在实证评估中,我们在OGBench中的运动、运球和多步骤操作等多种长时程任务上测试了NF-HIQL。NF-HIQL始终优于先前的目标条件和分层基线方法,在有限数据条件下展现出卓越的鲁棒性,凸显了基于流的架构在可扩展、数据高效的分层强化学习中的潜力。