Offline meta reinforcement learning (OMRL) has emerged as a promising approach for interaction avoidance and strong generalization performance by leveraging pre-collected data and meta-learning techniques. Previous context-based approaches predominantly rely on the intuition that alternating optimization between the context encoder and the policy can lead to performance improvements, as long as the context encoder follows the principle of maximizing the mutual information between the task and the task representation ($I(Z;M)$) while the policy adopts the standard offline reinforcement learning (RL) algorithms conditioning on the learned task representation. Despite promising results, the theoretical justification of performance improvements for such intuition remains underexplored. Inspired by the return discrepancy scheme in the model-based RL field, we find that the previous optimization framework can be linked with the general RL objective of maximizing the expected return, thereby providing a feasible explanation concerning performance improvements. Furthermore, after scrutinizing this optimization framework, we find it ignores the impacts stemming from the variation of the task representation in the alternating optimization process, which may lead to performance improvement collapse. We name this issue \underline{task representation shift} and theoretically prove that the monotonic performance improvements can be guaranteed with appropriate context encoder updates. We set different manners to rein in the task representation shift on three widely adopted training objectives concerning maximizing $I(Z;M)$ across different data qualities. Empirical results show that reining in the task representation shift can indeed improve performance. Our work opens up a new avenue for OMRL, leading to a better understanding between the performance and the task representation.
翻译:离线元强化学习(OMRL)通过利用预收集数据和元学习技术,已成为一种避免交互且具备强大泛化性能的有前景方法。先前基于上下文的方法主要依赖于以下直觉:只要上下文编码器遵循最大化任务与任务表征之间互信息($I(Z;M)$)的原则,而策略采用基于学习到的任务表征的标准离线强化学习(RL)算法,则上下文编码器与策略之间的交替优化能够带来性能提升。尽管取得了有希望的结果,但对此直觉的性能提升理论依据仍缺乏深入探索。受基于模型的RL领域中回报差异方案的启发,我们发现先前的优化框架可以与最大化期望回报的一般RL目标联系起来,从而为性能提升提供了可行的解释。此外,在仔细审视该优化框架后,我们发现其忽略了交替优化过程中任务表征变化所产生的影响,这可能导致性能提升崩溃。我们将此问题命名为\underline{任务表征偏移},并从理论上证明,通过适当的上下文编码器更新可以保证性能的单调提升。我们针对不同数据质量下最大化$I(Z;M)$的三种广泛采用的训练目标,设定了约束任务表征偏移的不同方式。实验结果表明,约束任务表征偏移确实能够提升性能。我们的工作为OMRL开辟了新途径,促进了对性能与任务表征之间关系的更深入理解。