Incorporating demonstration data into reinforcement learning (RL) can greatly accelerate learning, but existing approaches often assume demonstrations are optimal and fully aligned with the target task. In practice, demonstrations are frequently sparse, suboptimal, or misaligned, which can degrade performance when these demonstrations are integrated into RL. We propose Adaptive Policy Composition (APC), a hierarchical model that adaptively composes multiple data-driven Normalizing Flow (NF) priors. Instead of enforcing strict adherence to the priors, APC estimates each prior's applicability to the target task while leveraging them for exploration. Moreover, APC either refines useful priors, or sidesteps misaligned ones when necessary to optimize downstream reward. Across diverse benchmarks, APC accelerates learning when demonstrations are aligned, remains robust under severe misalignment, and leverages suboptimal demonstrations to bootstrap exploration while avoiding performance degradation caused by overly strict adherence to suboptimal demonstrations.
翻译:将演示数据融入强化学习(RL)可显著加速学习过程,但现有方法通常假设演示数据是最优且与目标任务完全对齐的。实践中,演示数据往往是稀疏的、次优的或未对齐的,当这些演示数据被整合到强化学习中时,可能导致性能下降。我们提出自适应策略组合(APC),一种自适应组合多个数据驱动的归一化流(NF)先验的分层模型。APC不强制严格遵循先验,而是在利用先验进行探索的同时,评估每个先验对目标任务的适用性。此外,APC会根据需要优化有用先验,或在必要时规避未对齐的先验,以最大化下游奖励。在多种基准测试中,当演示数据对齐时,APC能加速学习;在严重未对齐情况下仍保持鲁棒性;并能利用次优演示数据引导探索,同时避免因过度严格遵循次优演示数据而导致的性能下降。