Transfomer-based models have significantly advanced natural language processing, in particular the performance in text classification tasks. Nevertheless, these models face challenges in processing large files, primarily due to their input constraints, which are generally restricted to hundreds or thousands of tokens. Attempts to address this issue in existing models usually consist in extracting only a fraction of the essential information from lengthy inputs, while often incurring high computational costs due to their complex architectures. In this work, we address the challenge of classifying large files from the perspective of correlated multiple instance learning. We introduce LaFiCMIL, a method specifically designed for large file classification. LaFiCMIL is optimized for efficient operation on a single GPU, making it a versatile solution for binary, multi-class, and multi-label classification tasks. We conducted extensive experiments using seven diverse and comprehensive benchmark datasets to assess LaFiCMIL's effectiveness. By integrating BERT for feature extraction, LaFiCMIL demonstrates exceptional performance, setting new benchmarks across all datasets. A notable achievement of our approach is its ability to scale BERT to handle nearly 20,000 tokens while operating on a single GPU with 32GB of memory. This efficiency, coupled with its state-of-the-art performance, highlights LaFiCMIL's potential as a groundbreaking approach in the field of large file classification.
翻译:基于Transformer的模型显著推动了自然语言处理的发展,尤其在文本分类任务中表现出色。然而,这类模型在处理大文件时面临挑战,主要因为其输入限制通常仅能容纳数百或数千个词元。现有模型尝试解决该问题时,往往仅从长输入中提取部分关键信息,且因复杂架构导致计算成本高昂。本研究从关联多实例学习视角应对大文件分类挑战。我们提出专为大文件分类设计的LaFiCMIL方法。该方法针对单GPU高效运行优化,可灵活适用于二分类、多分类及多标签分类任务。为评估LaFiCMIL的有效性,我们使用七个多样化且全面的基准数据集进行了大量实验。通过集成BERT进行特征提取,LaFiCMIL展现出卓越性能,在所有数据集上均刷新了基准记录。本方法的一大亮点是,在仅配备32GB内存的单GPU上,即可将BERT扩展至处理近20000个词元。这种高效性与最先进的性能相结合,彰显了LaFiCMIL作为大文件分类领域突破性方法的潜力。