Decoder-Only or Encoder-Decoder? Interpreting Language Model as a Regularized Encoder-Decoder

The sequence-to-sequence (seq2seq) task aims at generating the target sequence based on the given input source sequence. Traditionally, most of the seq2seq task is resolved by the Encoder-Decoder framework which requires an encoder to encode the source sequence and a decoder to generate the target text. Recently, a bunch of new approaches have emerged that apply decoder-only language models directly to the seq2seq task. Despite the significant advancements in applying language models to the seq2seq task, there is still a lack of thorough analysis on the effectiveness of the decoder-only language model architecture. This paper aims to address this gap by conducting a detailed comparison between the encoder-decoder architecture and the decoder-only language model framework through the analysis of a regularized encoder-decoder structure. This structure is designed to replicate all behaviors in the classical decoder-only language model but has an encoder and a decoder making it easier to be compared with the classical encoder-decoder structure. Based on the analysis, we unveil the attention degeneration problem in the language model, namely, as the generation step number grows, less and less attention is focused on the source sequence. To give a quantitative understanding of this problem, we conduct a theoretical sensitivity analysis of the attention output with respect to the source input. Grounded on our analysis, we propose a novel partial attention language model to solve the attention degeneration problem. Experimental results on machine translation, summarization, and data-to-text generation tasks support our analysis and demonstrate the effectiveness of our proposed model.

翻译：序列到序列（seq2seq）任务旨在根据给定的输入源序列生成目标序列。传统上，大多数seq2seq任务由编码器-解码器框架解决，该框架需要编码器对源序列进行编码，并由解码器生成目标文本。近年来，出现了一系列直接将仅解码器语言模型应用于seq2seq任务的新方法。尽管将语言模型应用于seq2seq任务取得了显著进展，但对仅解码器语言模型架构有效性的深入分析仍然缺乏。本文旨在通过对正则化编码器-解码器结构的分析，对编码器-解码器架构与仅解码器语言模型框架进行详细比较，以填补这一空白。该结构旨在复现经典仅解码器语言模型中的所有行为，但包含编码器和解码器，从而便于与经典编码器-解码器结构进行比较。基于此分析，我们揭示了语言模型中的注意力退化问题，即随着生成步骤数量的增加，对源序列的关注越来越少。为定量理解该问题，我们对源输入在注意力输出上的影响进行了理论敏感性分析。基于分析结果，我们提出了一种新颖的部分注意力语言模型来解决注意力退化问题。在机器翻译、摘要生成和数据到文本生成任务上的实验结果支持了我们的分析，并验证了所提模型的有效性。