For large-scale IT corpora with hundreds of classes organized in a hierarchy, the task of accurate classification of classes at the higher level in the hierarchies is crucial to avoid errors propagating to the lower levels. In the business world, an efficient and explainable ML model is preferred over an expensive black-box model, especially if the performance increase is marginal. A current trend in the Natural Language Processing (NLP) community is towards employing huge pre-trained language models (PLMs) or what is known as self-attention models (e.g., BERT) for almost any kind of NLP task (e.g., question-answering, sentiment analysis, text classification). Despite the widespread use of PLMs and the impressive performance in a broad range of NLP tasks, there is a lack of a clear and well-justified need to as why these models are being employed for domain-specific text classification (TC) tasks, given the monosemic nature of specialized words (i.e., jargon) found in domain-specific text which renders the purpose of contextualized embeddings (e.g., PLMs) futile. In this paper, we compare the accuracies of some state-of-the-art (SOTA) models reported in the literature against a Linear SVM classifier and TFIDF vectorization model on three TC datasets. Results show a comparable performance for the LinearSVM. The findings of this study show that for domain-specific TC tasks, a linear model can provide a comparable, cheap, reproducible, and interpretable alternative to attention-based models.
翻译:对于包含数百个层次化组织类别的IT大规模语料库而言,高层级类别的准确分类至关重要,以避免错误向低层级传播。在商业实践中,可解释的高效机器学习模型优于昂贵的黑箱模型,尤其是在性能提升微乎其微的情况下。当前自然语言处理领域趋势是采用大型预训练语言模型或自注意力模型(如BERT)处理几乎所有NLP任务(如问答、情感分析、文本分类)。尽管预训练语言模型被广泛使用并在众多NLP任务中展现出卓越性能,但针对领域特定文本分类任务,其应用缺乏清晰且合理的必要性论证——领域文本中专业术语的单义性特性,使得上下文嵌入(如PLMs)的作用失效。本文在三个文本分类数据集上,将文献报道的若干最先进模型的准确率与线性SVM分类器及TFIDF向量化模型进行对比。结果显示线性SVM表现相当。本研究表明,在领域特定文本分类任务中,线性模型可作为基于注意力模型的可比、廉价、可复现且可解释的替代方案。