Long documents often exhibit structure with hierarchically organized elements of different functions, such as section headers and paragraphs. Despite the omnipresence of document structure, its role in natural language processing (NLP) remains opaque. Do long-document Transformer models acquire an internal representation of document structure during pre-training? How can structural information be communicated to a model after pre-training, and how does it influence downstream performance? To answer these questions, we develop a novel suite of probing tasks to assess structure-awareness of long-document Transformers, propose general-purpose structure infusion methods, and evaluate the effects of structure infusion on QASPER and Evidence Inference, two challenging long-document NLP tasks. Results on LED and LongT5 suggest that they acquire implicit understanding of document structure during pre-training, which can be further enhanced by structure infusion, leading to improved end-task performance. To foster research on the role of document structure in NLP modeling, we make our data and code publicly available.
翻译:长文档通常呈现出按层级组织的不同功能元素(如章节标题和段落)的结构。尽管文档结构无处不在,但其在自然语言处理中的作用仍不明确。长文档Transformer模型在预训练过程中是否获得了文档结构的内部表示?结构信息如何在预训练后传达给模型,又如何影响下游任务性能?为回答这些问题,我们开发了一套新颖的探针任务,用以评估长文档Transformer的结构感知能力;提出通用结构注入方法;并评估结构注入对QASPER和Evidence Inference这两项具有挑战性的长文档NLP任务的影响。LED和LongT5上的结果表明,这些模型在预训练期间隐式习得了文档结构理解能力,而结构注入可进一步增强该能力,从而提升端任务性能。为促进文档结构在NLP建模中作用的研究,我们公开了数据和代码。