Transformer-based models have achieved remarkable success in various Natural Language Processing (NLP) tasks, yet their ability to handle long documents is constrained by computational limitations. Traditional approaches, such as truncating inputs, sparse self-attention, and chunking, attempt to mitigate these issues, but they often lead to information loss and hinder the model's ability to capture long-range dependencies. In this paper, we introduce ChuLo, a novel chunk representation method for long document classification that addresses these limitations. Our ChuLo groups input tokens using unsupervised keyphrase extraction, emphasizing semantically important keyphrase based chunk to retain core document content while reducing input length. This approach minimizes information loss and improves the efficiency of Transformer-based models. Preserving all tokens in long document understanding, especially token classification tasks, is especially important to ensure that fine-grained annotations, which depend on the entire sequence context, are not lost. We evaluate our method on multiple long document classification tasks and long document token classification tasks, demonstrating its effectiveness through comprehensive qualitative and quantitative analyses.
翻译:基于Transformer的模型在各种自然语言处理任务中取得了显著成功,但其处理长文档的能力受限于计算约束。传统方法(如截断输入、稀疏自注意力机制和分块处理)试图缓解这些问题,但往往导致信息丢失并阻碍模型捕捉长距离依赖关系的能力。本文提出ChuLo——一种用于长文档分类的新型分块表示方法,以解决这些局限性。我们的ChuLo方法通过无监督关键词提取对输入词元进行分组,强调基于语义重要关键词的分块,在缩减输入长度的同时保留文档核心内容。该方法最大限度地减少了信息损失,并提升了基于Transformer模型的效率。在长文档理解任务中保留所有词元(特别是词元分类任务)尤为重要,这能确保依赖完整序列上下文的细粒度标注信息不会丢失。我们在多个长文档分类任务和长文档词元分类任务上评估了该方法,通过全面的定性与定量分析验证了其有效性。