Transformer-based models, such as BERT, have revolutionized various language tasks, but still struggle with large file classification due to their input limit (e.g., 512 tokens). Despite several attempts to alleviate this limitation, no method consistently excels across all benchmark datasets, primarily because they can only extract partial essential information from the input file. Additionally, they fail to adapt to the varied properties of different types of large files. In this work, we tackle this problem from the perspective of correlated multiple instance learning. The proposed approach, LaFiCMIL, serves as a versatile framework applicable to various large file classification tasks covering binary, multi-class, and multi-label classification tasks, spanning various domains including Natural Language Processing, Programming Language Processing, and Android Analysis. To evaluate its effectiveness, we employ eight benchmark datasets pertaining to Long Document Classification, Code Defect Detection, and Android Malware Detection. Leveraging BERT-family models as feature extractors, our experimental results demonstrate that LaFiCMIL achieves new state-of-the-art performance across all benchmark datasets. This is largely attributable to its capability of scaling BERT up to nearly 20K tokens, running on a single Tesla V-100 GPU with 32G of memory.
翻译:基于Transformer的模型(如BERT)虽已在各类语言任务中取得突破性进展,但由于其输入限制(例如512个token),在大文件分类任务中仍面临挑战。尽管已有多种方法试图突破这一限制,但尚无方法能持续在所有基准数据集上表现优异,根本原因在于这些方法只能从输入文件中提取部分关键信息。此外,它们无法适应不同类型大文件的多样化特性。本文从关联多示例学习角度切入解决该问题。所提出的LaFiCMIL框架具有通用性,可应用于涵盖二分类、多分类及多标签分类的各种大文件分类任务,涉及自然语言处理、编程语言处理及安卓分析等多个领域。为评估其有效性,我们采用覆盖长文档分类、代码缺陷检测及安卓恶意软件检测的八个基准数据集。实验结果表明,以BERT族模型为特征提取器的LaFiCMIL在所有基准数据集上均达到当前最优性能,这主要归功于其将BERT扩展至近20K token的能力——该框架可在配备32G内存的单张Tesla V-100 GPU上运行。