Although dominant in natural language processing, transformer-based models remain challenged by the task of long-sequence processing, because the computational cost of self-attention operations in transformers swells quadratically with the input sequence length. To alleviate the complexity of long-sequence processing, we propose a simple framework to enable the offthe-shelf pre-trained transformers to process much longer sequences, while the computation and memory costs remain growing linearly with the input sequence lengths. More specifically, our method divides each long-sequence input into a batch of chunks, then aligns the interchunk information during the encoding steps, and finally selects the most representative hidden states from the encoder for the decoding process. To extract inter-chunk semantic information, we align the start and end token embeddings among chunks in each encoding transformer block. To learn an effective hidden selection policy, we design a dual updating scheme inspired by reinforcement learning, which regards the decoders of transformers as environments, and the downstream performance metrics as the rewards to evaluate the hidden selection actions. Our empirical results on real-world long-text summarization and reading comprehension tasks demonstrate effective improvements compared to prior longsequence processing baselines.
翻译:尽管基于Transformer的模型在自然语言处理领域占据主导地位,但其在处理长序列任务时仍面临挑战,因为Transformer中自注意力操作的计算成本随输入序列长度呈二次方增长。为缓解长序列处理的复杂度,我们提出一个简单框架,使现成的预训练Transformer能够处理更长的序列,同时计算和内存成本保持随输入序列长度线性增长。具体而言,我们的方法将每个长序列输入划分为一批数据块,在编码步骤中对齐块间信息,最后从编码器中选择最具代表性的隐藏状态用于解码过程。为提取块间语义信息,我们在每个编码器Transformer模块中对齐各数据块的起始和结束词元嵌入。为学习有效的隐藏状态选择策略,我们受强化学习启发设计了一种双重更新方案:将Transformer的解码器视为环境,将下游性能指标作为评估隐藏选择动作的奖励。我们在现实世界长文本摘要和阅读理解任务上的实证结果表明,相较于现有长序列处理基线方法,本方法取得了有效改进。