In-Context Reinforcement Learning is an emerging field with great potential for advancing Artificial Intelligence. Its core capability lies in generalizing to unseen tasks through interaction with the environment. To master these capabilities, an agent must be trained on specifically curated data that includes a policy improvement that an algorithm seeks to extract and then apply in context in the environment. However, for numerous tasks, training RL agents may be unfeasible, while obtaining human demonstrations can be relatively easy. Additionally, it is rare to be given the optimal policy, typically, only suboptimal demonstrations are available. We propose $AD^{\epsilon}$, a method that leverages demonstrations without policy improvement and enables multi-task in-context learning in the presence of a suboptimal demonstrator. This is achieved by artificially creating a history of incremental improvement, wherein noise is systematically introduced into the demonstrator's policy. Consequently, each successive transition illustrates a marginally better trajectory than the previous one. Our approach was tested on the Dark Room and Dark Key-to-Door environments, resulting in over a $\textbf{2}$x improvement compared to the best available policy in the data.
翻译:上下文强化学习是一个新兴领域,具有推动人工智能发展的巨大潜力。其核心能力在于通过与环境的交互来泛化到未见过的任务。为了掌握这些能力,智能体必须基于精心策划的数据进行训练,这些数据包含算法试图提取并在环境中上下文应用的策略改进。然而,对于许多任务而言,训练强化学习智能体可能不可行,而获取人类演示则相对容易。此外,通常无法获得最优策略,往往只有次优的演示可用。我们提出了一种名为$AD^{\epsilon}$的方法,该方法利用没有策略改进的演示,并在存在次优演示者的情况下实现多任务上下文学习。这是通过人工创建渐进改进的历史来实现的,其中系统地将噪声引入演示者的策略中。因此,每个连续的过渡都展示出比前一个稍微更好的轨迹。我们的方法在黑暗房间和黑暗钥匙门环境中进行了测试,与数据中的最佳可用策略相比,性能提升了超过$\textbf{2}$倍。