Developing policies that can adjust to non-stationary environments is essential for real-world reinforcement learning applications. However, learning such adaptable policies in offline settings, with only a limited set of pre-collected trajectories, presents significant challenges. A key difficulty arises because the limited offline data makes it hard for the context encoder to differentiate between changes in the environment dynamics and shifts in the behavior policy, often leading to context misassociations. To address this issue, we introduce a novel approach called Debiased Offline Representation for fast online Adaptation (DORA). DORA incorporates an information bottleneck principle that maximizes mutual information between the dynamics encoding and the environmental data, while minimizing mutual information between the dynamics encoding and the actions of the behavior policy. We present a practical implementation of DORA, leveraging tractable bounds of the information bottleneck principle. Our experimental evaluation across six benchmark MuJoCo tasks with variable parameters demonstrates that DORA not only achieves a more precise dynamics encoding but also significantly outperforms existing baselines in terms of performance.
翻译:开发能够适应非平稳环境的策略对于现实世界强化学习应用至关重要。然而,在仅有有限预收集轨迹的离线设置中学习此类适应性策略,面临着重大挑战。一个关键难点在于,有限的离线数据使得上下文编码器难以区分环境动态的变化与行为策略的偏移,这常常导致上下文错误关联。为解决此问题,我们提出了一种名为快速在线适应的去偏离线表示的新方法。DORA 融合了信息瓶颈原理,该原理最大化动态编码与环境数据之间的互信息,同时最小化动态编码与行为策略动作之间的互信息。我们提出了 DORA 的一种实用实现,利用了信息瓶颈原理的可处理边界。我们在六个具有可变参数的 MuJoCo 基准任务上的实验评估表明,DORA 不仅实现了更精确的动态编码,而且在性能方面显著优于现有基线方法。