In-context learning is a promising approach for online policy learning of offline reinforcement learning (RL) methods, which can be achieved at inference time without gradient optimization. However, this method is hindered by significant computational costs resulting from the gathering of large training trajectory sets and the need to train large Transformer models. We address this challenge by introducing an In-context Exploration-Exploitation (ICEE) algorithm, designed to optimize the efficiency of in-context policy learning. Unlike existing models, ICEE performs an exploration-exploitation trade-off at inference time within a Transformer model, without the need for explicit Bayesian inference. Consequently, ICEE can solve Bayesian optimization problems as efficiently as Gaussian process biased methods do, but in significantly less time. Through experiments in grid world environments, we demonstrate that ICEE can learn to solve new RL tasks using only tens of episodes, marking a substantial improvement over the hundreds of episodes needed by the previous in-context learning method.
翻译:上下文学习是一种有前景的方法,可用于离线强化学习方法在推理时进行在线策略学习,而无需梯度优化。然而,该方法由于需要收集大量训练轨迹集和训练大型Transformer模型,导致计算成本高昂。我们提出了一种上下文探索-利用算法来解决这一挑战,旨在优化上下文策略学习的效率。与现有模型不同,ICEE在Transformer模型内部实现了推理时探索与利用的权衡,无需显式贝叶斯推断。因此,ICEE在解决贝叶斯优化问题时,能够达到与高斯过程偏置方法相当的效率,但只需显著更短的时间。通过在网格世界环境中的实验,我们证明ICEE仅需数十个回合即可学会解决新的强化学习任务,相较于先前上下文学习方法所需的数百个回合,实现了大幅提升。