Most state-of-the-art techniques for Language Models (LMs) today rely on transformer-based architectures and their ubiquitous attention mechanism. However, the exponential growth in computational requirements with longer input sequences confines Transformers to handling short passages. Recent efforts have aimed to address this limitation by introducing selective attention mechanisms, notably local and global attention. While sparse attention mechanisms, akin to full attention in being Turing-complete, have been theoretically established, their practical impact on pre-training remains unexplored. This study focuses on empirically assessing the influence of global attention on BERT pre-training. The primary steps involve creating an extensive corpus of structure-aware text through arXiv data, alongside a text-only counterpart. We carry out pre-training on these two datasets, investigate shifts in attention patterns, and assess their implications for downstream tasks. Our analysis underscores the significance of incorporating document structure into LM models, demonstrating their capacity to excel in more abstract tasks, such as document understanding.
翻译:当前大多数最先进的语言模型技术依赖于基于Transformer的架构及其普遍存在的注意力机制。然而,随着输入序列长度的增加,计算需求呈指数级增长,这限制了Transformer处理长文本的能力。最近的研究试图通过引入选择性注意力机制(特别是局部注意力和全局注意力)来解决这一局限性。虽然稀疏注意力机制在理论上已被证明与全注意力机制一样是图灵完备的,但它们对预训练的实际影响尚未得到探索。本研究重点通过实证方法评估全局注意力对BERT预训练的影响。主要步骤包括:通过arXiv数据构建一个大规模的结构感知文本语料库,以及一个纯文本对照语料库。我们对这两个数据集进行预训练,研究注意力模式的变化,并评估它们对下游任务的影响。我们的分析强调了将文档结构纳入语言模型的重要性,证明了这些模型在更抽象的任务(如文档理解)中表现出色的能力。