Despite the broad application of deep reinforcement learning (RL), transferring and adapting the policy to unseen but similar environments is still a significant challenge. Recently, the language-conditioned policy is proposed to facilitate policy transfer through learning the joint representation of observation and text that catches the compact and invariant information across environments. Existing studies of language-conditioned RL methods often learn the joint representation as a simple latent layer for the given instances (episode-specific observation and text), which inevitably includes noisy or irrelevant information and cause spurious correlations that are dependent on instances, thus hurting generalization performance and training efficiency. To address this issue, we propose a conceptual reinforcement learning (CRL) framework to learn the concept-like joint representation for language-conditioned policy. The key insight is that concepts are compact and invariant representations in human cognition through extracting similarities from numerous instances in real-world. In CRL, we propose a multi-level attention encoder and two mutual information constraints for learning compact and invariant concepts. Verified in two challenging environments, RTFM and Messenger, CRL significantly improves the training efficiency (up to 70%) and generalization ability (up to 30%) to the new environment dynamics.
翻译:尽管深度强化学习(RL)应用广泛,但将其策略迁移并适应未见但相似的环境仍是重大挑战。近年来,语言条件策略通过学习融合观察与文本的联合表征来捕捉跨环境的紧凑不变信息,从而促进策略迁移。现有语言条件RL方法通常将联合表征学习为给定实例(特定回合的观察与文本)的简单隐层,这不可避免地包含噪声或无关信息,并产生依赖于实例的虚假相关性,从而损害泛化性能与训练效率。为解决此问题,我们提出概念强化学习(CRL)框架,为语言条件策略学习类概念联合表征。核心思想在于:人类认知中的概念是通过从真实世界的大量实例中提取相似性而形成的紧凑不变表征。在CRL中,我们设计了一种多级注意力编码器与两种互信息约束,用以学习紧凑且不变的概念。在RTFM与Messenger这两个具有挑战性的环境中验证表明,CRL能够显著提升对新环境动态的训练效率(最高达70%)与泛化能力(最高达30%)。