Generative Pre-trained Transformer (GPT) architectures are the most popular design for language modeling. Energy-based modeling is a different paradigm that views inference as a dynamical process operating on an energy landscape. We propose a minimal modification of the GPT setting to unify it with the EBM framework. The inference step of our model, which we call eNeRgy-GPT (NRGPT), is conceptualized as an exploration of the tokens on the energy landscape. We prove, and verify empirically, that under certain circumstances this exploration becomes gradient descent, although they don't necessarily lead to the best performing models. We demonstrate that our model performs well for simple language (Shakespeare dataset), algebraic ListOPS tasks, and richer settings such as OpenWebText language modeling. We also observe that our models may be more resistant to overfitting, doing so only during very long training.
翻译:生成式预训练Transformer(GPT)架构是语言建模中最流行的设计。基于能量的建模是一种不同的范式,它将推理视为在能量景观上运行的动态过程。我们提出对GPT设置进行最小修改,以将其与EBM框架统一起来。我们模型的推理步骤(我们称之为eNeRgy-GPT(NRGPT))被概念化为在能量景观上对词元的探索。我们证明并在经验上验证,在某些情况下,这种探索会变为梯度下降,尽管它们不一定能产生性能最佳的模型。我们证明,我们的模型在简单语言(莎士比亚数据集)、代数ListOPS任务以及更丰富的设置(如OpenWebText语言建模)中表现良好。我们还观察到,我们的模型可能更抗过拟合,仅在非常长的训练过程中才会出现过拟合。