Natural Language Processing (NLP) for low-resource languages presents significant challenges, particularly due to the scarcity of high-quality annotated data and linguistic resources. The choice of embeddings plays a critical role in enhancing the performance of NLP tasks, such as news classification, sentiment analysis, and hate speech detection, especially for low-resource languages like Marathi. In this study, we investigate the impact of various embedding techniques- Contextual BERT-based, Non-Contextual BERT-based, and FastText-based on NLP classification tasks specific to the Marathi language. Our research includes a thorough evaluation of both compressed and uncompressed embeddings, providing a comprehensive overview of how these embeddings perform across different scenarios. Specifically, we compare two BERT model embeddings, Muril and MahaBERT, as well as two FastText model embeddings, IndicFT and MahaFT. Our evaluation includes applying embeddings to a Multiple Logistic Regression (MLR) classifier for task performance assessment, as well as TSNE visualizations to observe the spatial distribution of these embeddings. The results demonstrate that contextual embeddings outperform non-contextual embeddings. Furthermore, BERT-based non-contextual embeddings extracted from the first BERT embedding layer yield better results than FastText-based embeddings, suggesting a potential alternative to FastText embeddings.
翻译:低资源语言的自然语言处理(NLP)面临重大挑战,主要源于高质量标注数据和语言资源的匮乏。嵌入方法的选择对于提升NLP任务(如新闻分类、情感分析和仇恨言论检测)的性能至关重要,尤其对于马拉地语等低资源语言。本研究探讨了多种嵌入技术——基于上下文的BERT嵌入、非上下文的BERT嵌入以及基于FastText的嵌入——对马拉地语特定NLP分类任务的影响。我们的研究包含对压缩与未压缩嵌入的全面评估,系统呈现了这些嵌入在不同场景下的性能表现。具体而言,我们比较了两种BERT模型嵌入(Muril和MahaBERT)以及两种FastText模型嵌入(IndicFT和MahaFT)。评估方法包括将嵌入应用于多元逻辑回归(MLR)分类器进行任务性能评估,并通过TSNE可视化观察嵌入的空间分布。结果表明,上下文嵌入优于非上下文嵌入。此外,从BERT第一嵌入层提取的基于BERT的非上下文嵌入比基于FastText的嵌入表现更好,这为FastText嵌入提供了一种潜在的替代方案。