Transfomer-based models have significantly advanced natural language processing, in particular the performance in text classification tasks. Nevertheless, these models face challenges in processing large files, primarily due to their input constraints, which are generally restricted to hundreds or thousands of tokens. Attempts to address this issue in existing models usually consist in extracting only a fraction of the essential information from lengthy inputs, while often incurring high computational costs due to their complex architectures. In this work, we address the challenge of classifying large files from the perspective of correlated multiple instance learning. We introduce LaFiCMIL, a method specifically designed for large file classification. LaFiCMIL is optimized for efficient operation on a single GPU, making it a versatile solution for binary, multi-class, and multi-label classification tasks. We conducted extensive experiments using seven diverse and comprehensive benchmark datasets to assess LaFiCMIL's effectiveness. By integrating BERT for feature extraction, LaFiCMIL demonstrates exceptional performance, setting new benchmarks across all datasets. A notable achievement of our approach is its ability to scale BERT to handle nearly 20,000 tokens while operating on a single GPU with 32GB of memory. This efficiency, coupled with its state-of-the-art performance, highlights LaFiCMIL's potential as a groundbreaking approach in the field of large file classification.
翻译:基于Transformer的模型显著推动了自然语言处理的发展,尤其在文本分类任务中表现出色。然而,这些模型在处理大文件时面临挑战,主要受限于其输入长度通常被约束在数百至数千个标记内。现有模型解决此问题的尝试通常仅从长输入中提取部分关键信息,同时因其复杂架构往往导致高昂的计算成本。本研究从相关多示例学习的角度应对大文件分类的挑战,提出了专为大文件分类设计的LaFiCMIL方法。该方法针对单GPU高效运行进行了优化,使其能够灵活适用于二分类、多分类及多标签分类任务。我们通过七个多样化且全面的基准数据集进行了广泛实验以评估LaFiCMIL的有效性。通过集成BERT进行特征提取,LaFiCMIL展现出卓越性能,在所有数据集上均创造了新的基准记录。本方法的一个显著成就是能够在仅配备32GB内存的单GPU上,将BERT的扩展处理能力提升至近20,000个标记。这种高效性与顶尖性能的结合,彰显了LaFiCMIL作为大文件分类领域突破性方法的潜力。