The research on code-mixed data is limited due to the unavailability of dedicated code-mixed datasets and pre-trained language models. In this work, we focus on the low-resource Indian language Marathi which lacks any prior work in code-mixing. We present L3Cube-MeCorpus, a large code-mixed Marathi-English (Mr-En) corpus with 5 million tweets for pretraining. We also release L3Cube-MeBERT and MeRoBERTa, code-mixed BERT-based transformer models pre-trained on MeCorpus. Furthermore, for benchmarking, we present three supervised datasets MeHate, MeSent, and MeLID for downstream tasks like code-mixed Mr-En hate speech detection, sentiment analysis, and language identification respectively. These evaluation datasets individually consist of manually annotated \url{~}12,000 Marathi-English code-mixed tweets. Ablations show that the models trained on this novel corpus significantly outperform the existing state-of-the-art BERT models. This is the first work that presents artifacts for code-mixed Marathi research. All datasets and models are publicly released at https://github.com/l3cube-pune/MarathiNLP .
翻译:由于缺乏专用的混合编码数据集和预训练语言模型,关于混合编码数据的研究十分有限。本研究聚焦于低资源印度语言马拉地语,该语言此前在混合编码领域尚无相关研究工作。我们推出了L3Cube-MeCorpus,一个包含500万条推文的大型混合编码马拉地语-英语(Mr-En)语料库,用于预训练。同时,我们发布了基于MeCorpus预训练的混合编码BERT架构变换器模型L3Cube-MeBERT和MeRoBERTa。此外,为提供基准评估,我们提出了三个监督数据集——MeHate、MeSent和MeLID,分别用于下游任务中的混合编码Mr-En仇恨言论检测、情感分析和语言识别。每个评估数据集由约12,000条人工标注的马拉地语-英语混合编码推文组成。消融实验表明,基于该新语料库训练的模型显著优于现有的最先进的BERT模型。这是首个为混合编码马拉地语研究提供相关工具的工作。所有数据集和模型已在https://github.com/l3cube-pune/MarathiNLP 公开发布。