The research on code-mixed data is limited due to the unavailability of dedicated code-mixed datasets and pre-trained language models. In this work, we focus on the low-resource Indian language Marathi which lacks any prior work in code-mixing. We present L3Cube-MeCorpus, a large code-mixed Marathi-English (Mr-En) corpus with 10 million social media sentences for pretraining. We also release L3Cube-MeBERT and MeRoBERTa, code-mixed BERT-based transformer models pre-trained on MeCorpus. Furthermore, for benchmarking, we present three supervised datasets MeHate, MeSent, and MeLID for downstream tasks like code-mixed Mr-En hate speech detection, sentiment analysis, and language identification respectively. These evaluation datasets individually consist of manually annotated \url{~}12,000 Marathi-English code-mixed tweets. Ablations show that the models trained on this novel corpus significantly outperform the existing state-of-the-art BERT models. This is the first work that presents artifacts for code-mixed Marathi research. All datasets and models are publicly released at https://github.com/l3cube-pune/MarathiNLP .
翻译:针对混合编码数据的研究因缺乏专用的混合编码数据集和预训练语言模型而受限。本文聚焦于低资源印度语言马拉地语(该语言在混合编码领域尚无前期工作),提出了L3Cube-MeCorpus——一个包含1000万条社交媒体句子的大型混合编码马拉地语-英语(Mr-En)语料库,用于预训练。同时,我们发布了基于MeCorpus预训练的混合编码BERT变体Transformer模型L3Cube-MeBERT和MeRoBERTa。此外,为建立评估基准,我们提供了三个监督数据集:MeHate、MeSent和MeLID,分别用于下游任务:混合编码Mr-En仇恨言论检测、情感分析和语言识别。这些评估数据集各自包含手动标注的约12,000条马拉地语-英语混合编码推文。消融实验表明,基于该新型语料库训练的模型显著优于现有最优的BERT模型。本文是首个为混合编码马拉地语研究提供数据资源的工作。所有数据集和模型已在https://github.com/l3cube-pune/MarathiNLP 公开发布。