The quadratic complexity of self-attention in Transformers has hindered the processing of long text. To alleviate this problem, previous works have proposed to sparsify the attention matrix, taking advantage of the observation that crucial information about a token can be derived from its neighbors. These methods typically combine one or another form of local attention and global attention. Such combinations introduce abrupt changes in contextual granularity when going from local to global, which may be undesirable. We believe that a smoother transition could potentially enhance model's ability to capture long-context dependencies. In this study, we introduce Fovea Transformer, a long-context focused transformer that addresses the challenges of capturing global dependencies while maintaining computational efficiency. To achieve this, we construct a multi-scale tree from the input sequence, and use representations of context tokens with a progressively coarser granularity in the tree, as their distance to the query token increases. We evaluate our model on three long-context summarization tasks\footnote{Our code is publicly available at: \textit{https://github.com/ZiweiHe/Fovea-Transformer}}. It achieves state-of-the-art performance on two of them, and competitive results on the third with mixed improvement and setback of the evaluation metrics.
翻译:Transformer中自注意力的二次复杂度阻碍了长文本的处理。为缓解此问题,先前研究利用"关键信息可从相邻标记中获取"的观察,提出了注意力矩阵稀疏化方法。这些方法通常结合某种形式的局部注意力与全局注意力。然而,从局部到全局的转换会导致上下文粒度的突变,这可能是不可取的。我们认为,更平滑的过渡有望提升模型捕获长上下文依赖的能力。本研究提出Fovea Transformer——一种专注于长上下文的Transformer架构,旨在捕获全局依赖的同时保持计算效率。为此,我们从输入序列构建多尺度树结构,并利用树中与查询标记距离递增的上下文标记表示(粒度逐渐变粗)。我们在三项长文本摘要任务上评估该模型\footnote{代码公开于:\textit{https://github.com/ZiweiHe/Fovea-Transformer}}。模型在其中两项任务上达到最先进性能,在第三项任务上则呈现出评估指标提升与退步并存的竞争性结果。