Understanding the Training and Generalization of Pretrained Transformer for Sequential Decision Making

In this paper, we consider the supervised pre-trained transformer for a class of sequential decision-making problems. The class of considered problems is a subset of the general formulation of reinforcement learning in that there is no transition probability matrix; though seemingly restrictive, the subset class of problems covers bandits, dynamic pricing, and newsvendor problems as special cases. Such a structure enables the use of optimal actions/decisions in the pre-training phase, and the usage also provides new insights for the training and generalization of the pre-trained transformer. We first note the training of the transformer model can be viewed as a performative prediction problem, and the existing methods and theories largely ignore or cannot resolve an out-of-distribution issue. We propose a natural solution that includes the transformer-generated action sequences in the training procedure, and it enjoys better properties both numerically and theoretically. The availability of the optimal actions in the considered tasks also allows us to analyze the properties of the pre-trained transformer as an algorithm and explains why it may lack exploration and how this can be automatically resolved. Numerically, we categorize the advantages of pre-trained transformers over the structured algorithms such as UCB and Thompson sampling into three cases: (i) it better utilizes the prior knowledge in the pre-training data; (ii) it can elegantly handle the misspecification issue suffered by the structured algorithms; (iii) for short time horizon such as $T\le50$, it behaves more greedy and enjoys much better regret than the structured algorithms designed for asymptotic optimality.

翻译：本文研究用于一类序列决策问题的监督预训练Transformer。所考虑的问题类别是强化学习一般公式的一个子集，其特点在于不存在状态转移概率矩阵；尽管看似受限，该子类问题涵盖了赌博机、动态定价和报童问题等特例。此类结构使得在预训练阶段能够使用最优行动/决策，这种使用方式也为预训练Transformer的训练与泛化提供了新的见解。我们首先指出Transformer模型的训练可视为一个执行性预测问题，而现有方法与理论大多忽略或无法解决分布外问题。我们提出一种自然解决方案，将Transformer生成的动作序列纳入训练过程，该方法在数值与理论上均展现出更优特性。所研究任务中可获得的最优行动，还使我们能够分析预训练Transformer作为算法的性质，并解释其为何可能缺乏探索性以及如何自动解决该问题。数值实验中，我们将预训练Transformer相对于结构化算法（如UCB和Thompson采样）的优势归纳为三种情况：(i) 能更有效地利用预训练数据中的先验知识；(ii) 能优雅处理结构化算法面临的模型误设问题；(iii) 在较短时间范围（如$T\le50$）内，其行为更具贪婪性，且比追求渐近最优性的结构化算法具有更优的遗憾界。