In many domains, autoregressive models can attain high likelihood on the task of predicting the next observation. However, this maximum-likelihood (MLE) objective does not necessarily match a downstream use-case of autoregressively generating high-quality sequences. The MLE objective weights sequences proportionally to their frequency under the data distribution, with no guidance for the model's behaviour out of distribution (OOD): leading to compounding error during autoregressive generation. In order to address this compounding error problem, we formulate sequence generation as an imitation learning (IL) problem. This allows us to minimize a variety of divergences between the distribution of sequences generated by an autoregressive model and sequences from a dataset, including divergences with weight on OOD generated sequences. The IL framework also allows us to incorporate backtracking by introducing a backspace action into the generation process. This further mitigates the compounding error problem by allowing the model to revert a sampled token if it takes the sequence OOD. Our resulting method, SequenceMatch, can be implemented without adversarial training or architectural changes. We identify the SequenceMatch-$\chi^2$ divergence as a more suitable training objective for autoregressive models which are used for generation. We show that empirically, SequenceMatch training leads to improvements over MLE on text generation with language models and arithmetic.
翻译:在许多领域中,自回归模型在预测下一个观测值的任务上能够获得高似然度。然而,这种最大似然估计目标并不一定匹配自回归生成高质量序列的下游应用场景。MLE目标根据序列在数据分布下的频率分配权重,但对模型在分布外(OOD)的行为缺乏指导,导致自回归生成过程中的误差累积。为解决这一误差累积问题,我们将序列生成建模为模仿学习问题。这使我们能够最小化自回归模型生成的序列分布与数据集序列分布之间的多种散度,包括对OOD生成序列加权后的散度。IL框架还允许我们通过在生成过程中引入退格动作来融入回溯机制,从而进一步缓解误差累积问题——当采样token导致序列进入OOD状态时,模型可将其回退。由此产生的SequenceMatch算法无需对抗训练或架构修改即可实现。我们确定SequenceMatch-χ²散度作为更适合用于生成任务的自回归模型训练目标。实验表明,在基于语言模型的文本生成及算术任务中,SequenceMatch训练相比MLE目标能取得更优效果。