The present paper introduces new sentiment data, MaCMS, for Magahi-Hindi-English (MHE) code-mixed language, where Magahi is a less-resourced minority language. This dataset is the first Magahi-Hindi-English code-mixed dataset for sentiment analysis tasks. Further, we also provide a linguistics analysis of the dataset to understand the structure of code-mixing and a statistical study to understand the language preferences of speakers with different polarities. With these analyses, we also train baseline models to evaluate the dataset's quality.
翻译:本文介绍了针对摩揭陀语-印地语-英语(MHE)混合语码的新情感数据集MaCMS,其中摩揭陀语是一种资源匮乏的少数民族语言。该数据集是首个面向情感分析任务的摩揭陀语-印地语-英语混合语码数据集。此外,我们对数据集进行了语言学分析以理解语码混合的结构,并通过统计研究探究不同情感极性下说话者的语言偏好。基于上述分析,我们还训练了基线模型以评估数据集的质量。