Large transformer models pretrained on offline reinforcement learning datasets have demonstrated remarkable in-context reinforcement learning (ICRL) capabilities, where they can make good decisions when prompted with interaction trajectories from unseen environments. However, when and how transformers can be trained to perform ICRL have not been theoretically well-understood. In particular, it is unclear which reinforcement-learning algorithms transformers can perform in context, and how distribution mismatch in offline training data affects the learned algorithms. This paper provides a theoretical framework that analyzes supervised pretraining for ICRL. This includes two recently proposed training methods -- algorithm distillation and decision-pretrained transformers. First, assuming model realizability, we prove the supervised-pretrained transformer will imitate the conditional expectation of the expert algorithm given the observed trajectory. The generalization error will scale with model capacity and a distribution divergence factor between the expert and offline algorithms. Second, we show transformers with ReLU attention can efficiently approximate near-optimal online reinforcement learning algorithms like LinUCB and Thompson sampling for stochastic linear bandits, and UCB-VI for tabular Markov decision processes. This provides the first quantitative analysis of the ICRL capabilities of transformers pretrained from offline trajectories.
翻译:在离线强化学习数据集上预训练的大型Transformer模型已展现出卓越的上下文强化学习能力——当输入未见环境的交互轨迹时,它们能够做出优质决策。然而,Transformer何时及如何能被训练以执行上下文强化学习,在理论上尚未得到充分阐释。具体而言,目前尚不清楚Transformer能在上下文中执行哪些强化学习算法,以及离线训练数据中的分布失配对所学算法产生何种影响。本文提出了一个分析上下文强化学习中监督预训练的理论框架,涵盖两种近期提出的训练方法:算法蒸馏与决策预训练Transformer。首先,在假设模型可实现的前提下,我们证明监督预训练的Transformer将模仿专家算法在给定观测轨迹条件下的期望行为,其泛化误差将随模型容量及专家算法与离线算法间的分布差异因子而缩放。其次,我们证明采用ReLU注意力的Transformer能够高效逼近近乎最优的在线强化学习算法,例如随机线性赌博机中的LinUCB与汤普森采样,以及表格化马尔可夫决策过程中的UCB-VI。这为基于离线轨迹预训练的Transformer的上下文强化学习能力提供了首次定量分析。