Generative Pre-trained Transformer (GPT) architectures are the most popular design for language modeling. Energy-based modeling is a different paradigm that views inference as a dynamical process operating on an energy landscape. We propose a minimal modification of the GPT setting to unify it with the EBM framework. The inference step of our model, which we call eNeRgy-GPT (NRGPT), is conceptualized as an exploration of the tokens on the energy landscape. We prove, and verify empirically, that under certain circumstances this exploration becomes gradient descent, although they don't necessarily lead to the best performing models. We demonstrate that our model performs well for simple language (Shakespeare dataset), algebraic ListOPS tasks, and richer settings such as OpenWebText language modeling. We also observe that our models may be more resistant to overfitting, doing so only during very long training.
翻译:生成式预训练Transformer(GPT)架构是语言建模中最流行的设计。基于能量的建模则是一种不同的范式,它将推理视作在能量景观上运行的动态过程。本文提出对GPT设置进行最小修改,以将其与能量基模型框架统一起来。我们所提模型(称为eNeRgy-GPT,NRGPT)的推理步骤被概念化为在能量景观上对词元的探索。我们证明并通过实验验证,在某些条件下,这种探索会变为梯度下降,尽管这并不一定会产生性能最优的模型。我们表明,该模型在简单语言(莎士比亚数据集)、代数ListOPS任务以及更丰富场景(如OpenWebText语言建模)中表现良好。我们还观察到,此类模型可能对过拟合更具抵抗性,仅在极长训练过程中才会出现过拟合现象。