To generalize across tasks, an agent should acquire knowledge from past tasks that facilitate adaptation and exploration in future tasks. We focus on the problem of in-context adaptation and exploration, where an agent only relies on context, i.e., history of states, actions and/or rewards, rather than gradient-based updates. Posterior sampling (extension of Thompson sampling) is a promising approach, but it requires Bayesian inference and dynamic programming, which often involve unknowns (e.g., a prior) and costly computations. To address these difficulties, we use a transformer to learn an inference process from training tasks and consider a hypothesis space of partial models, represented as small Markov decision processes that are cheap for dynamic programming. In our version of the Symbolic Alchemy benchmark, our method's adaptation speed and exploration-exploitation balance approach those of an exact posterior sampling oracle. We also show that even though partial models exclude relevant information from the environment, they can nevertheless lead to good policies.
翻译:为了在任务间进行泛化,智能体应从以往任务中获取知识,以促进未来任务中的自适应与探索。我们聚焦于上下文自适应与探索问题,其中智能体仅依赖上下文(即状态、动作和/或奖励的历史记录),而非基于梯度的更新方法。后验采样(汤普森采样的扩展)是一种有前景的方法,但其需要贝叶斯推理和动态规划,这通常涉及未知因素(如先验分布)及高昂计算成本。为解决这些难题,我们使用Transformer从训练任务中学习推理过程,并考虑一个由部分模型组成的假设空间,这些模型表示为适合动态规划的小型马尔可夫决策过程。在符号炼金术基准测试中,我们的方法的自适应速度与探索-利用平衡接近于精确后验采样基准。我们还表明,即使部分模型排除了环境中的相关信息,它们仍能生成良好的策略。