Large transformer models pretrained on offline reinforcement learning datasets have demonstrated remarkable in-context reinforcement learning (ICRL) capabilities, where they can make good decisions when prompted with interaction trajectories from unseen environments. However, when and how transformers can be trained to perform ICRL have not been theoretically well-understood. In particular, it is unclear which reinforcement-learning algorithms transformers can perform in context, and how distribution mismatch in offline training data affects the learned algorithms. This paper provides a theoretical framework that analyzes supervised pretraining for ICRL. This includes two recently proposed training methods -- algorithm distillation and decision-pretrained transformers. First, assuming model realizability, we prove the supervised-pretrained transformer will imitate the conditional expectation of the expert algorithm given the observed trajectory. The generalization error will scale with model capacity and a distribution divergence factor between the expert and offline algorithms. Second, we show transformers with ReLU attention can efficiently approximate near-optimal online reinforcement learning algorithms like LinUCB and Thompson sampling for stochastic linear bandits, and UCB-VI for tabular Markov decision processes. This provides the first quantitative analysis of the ICRL capabilities of transformers pretrained from offline trajectories.
翻译:基于离线强化学习数据集预训练的大型Transformer模型展现出显著的上下文强化学习(ICRL)能力——当面对未见环境中的交互轨迹时,这些模型能够做出良好决策。然而,Transformer被训练执行ICRL的时机与机制尚未从理论上得到充分理解。具体而言,Transformer在上下文中能够执行哪些强化学习算法,以及离线训练数据中的分布失配对所学算法的影响尚不明确。本文构建了一个用于分析ICRL监督预训练的理论框架,涵盖两种近期提出的训练方法——算法蒸馏与决策预训练Transformer。首先,在模型可实现性假设下,我们证明监督预训练的Transformer将模仿专家算法在给定观测轨迹下的条件期望,其泛化误差将随模型容量和专家算法与离线算法之间的分布散度因子而扩展。其次,我们证明使用ReLU注意力机制的Transformer能够高效近似近优在线强化学习算法,包括针对随机线性老虎机的LinUCB与汤普森采样,以及针对表格型马尔可夫决策过程的UCB-VI。这为离线轨迹预训练Transformer的ICRL能力提供了首个定量分析。