The present paper introduces new sentiment data, MaCMS, for Magahi-Hindi-English (MHE) code-mixed language, where Magahi is a less-resourced minority language. This dataset is the first Magahi-Hindi-English code-mixed dataset for sentiment analysis tasks. Further, we also provide a linguistics analysis of the dataset to understand the structure of code-mixing and a statistical study to understand the language preferences of speakers with different polarities. With these analyses, we also train baseline models to evaluate the dataset's quality.
翻译:本文介绍了面向Magahi-印地语-英语(MHE)语码混合语言的新情感数据集MaCMS,其中Magahi是一种资源匮乏的少数语言。该数据集是首个用于情感分析任务的Magahi-印地语-英语语码混合数据集。此外,我们还对该数据集进行了语言学分析以理解语码混合的结构,并开展统计研究以探究不同情感倾向的说话者语言偏好。基于这些分析,我们训练了基线模型以评估该数据集的质量。