My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models and Evaluation Benchmarks

The research on code-mixed data is limited due to the unavailability of dedicated code-mixed datasets and pre-trained language models. In this work, we focus on the low-resource Indian language Marathi which lacks any prior work in code-mixing. We present L3Cube-MeCorpus, a large code-mixed Marathi-English (Mr-En) corpus with 5 million tweets for pretraining. We also release L3Cube-MeBERT and MeRoBERTa, code-mixed BERT-based transformer models pre-trained on MeCorpus. Furthermore, for benchmarking, we present three supervised datasets MeHate, MeSent, and MeLID for downstream tasks like code-mixed Mr-En hate speech detection, sentiment analysis, and language identification respectively. These evaluation datasets individually consist of manually annotated \url{~}12,000 Marathi-English code-mixed tweets. Ablations show that the models trained on this novel corpus significantly outperform the existing state-of-the-art BERT models. This is the first work that presents artifacts for code-mixed Marathi research. All datasets and models are publicly released at https://github.com/l3cube-pune/MarathiNLP .

翻译：由于缺乏专用的混合编码数据集和预训练语言模型，关于混合编码数据的研究十分有限。本研究聚焦于低资源印度语言马拉地语，该语言此前在混合编码领域尚无相关研究工作。我们推出了L3Cube-MeCorpus，一个包含500万条推文的大型混合编码马拉地语-英语（Mr-En）语料库，用于预训练。同时，我们发布了基于MeCorpus预训练的混合编码BERT架构变换器模型L3Cube-MeBERT和MeRoBERTa。此外，为提供基准评估，我们提出了三个监督数据集——MeHate、MeSent和MeLID，分别用于下游任务中的混合编码Mr-En仇恨言论检测、情感分析和语言识别。每个评估数据集由约12,000条人工标注的马拉地语-英语混合编码推文组成。消融实验表明，基于该新语料库训练的模型显著优于现有的最先进的BERT模型。这是首个为混合编码马拉地语研究提供相关工具的工作。所有数据集和模型已在https://github.com/l3cube-pune/MarathiNLP 公开发布。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日