Given the number of Arabic speakers worldwide and the notably large amount of content in the web today in some fields such as law, medicine, or even news, documents of considerable length are produced regularly. Classifying those documents using traditional learning models is often impractical since extended length of the documents increases computational requirements to an unsustainable level. Thus, it is necessary to customize these models specifically for long textual documents. In this paper we propose two simple but effective models to classify long length Arabic documents. We also fine-tune two different models-namely, Longformer and RoBERT, for the same task and compare their results to our models. Both of our models outperform the Longformer and RoBERT in this task over two different datasets.
翻译:鉴于全球阿拉伯语使用者数量庞大,且当今网络在法律、医学乃至新闻等领域中长篇幅内容显著增多,篇幅可观的文档被频繁生成。传统学习模型对这些文档进行分类往往不切实际,因为文档长度的增加会将计算需求推至不可持续的水平。因此,有必要专门针对长文本文档定制这些模型。本文提出两种简单但有效的模型用于长篇幅阿拉伯语文档的分类。同时,我们针对同一任务微调了两种不同的模型——即Longformer和RoBERT——并将其结果与我们的模型进行对比。在两个不同的数据集上,我们的两个模型在该任务中的表现均优于Longformer和RoBERT。