This work introduces the L3Cube-MahaSocialNER dataset, the first and largest social media dataset specifically designed for Named Entity Recognition (NER) in the Marathi language. The dataset comprises 18,000 manually labeled sentences covering eight entity classes, addressing challenges posed by social media data, including non-standard language and informal idioms. Deep learning models, including CNN, LSTM, BiLSTM, and Transformer models, are evaluated on the individual dataset with IOB and non-IOB notations. The results demonstrate the effectiveness of these models in accurately recognizing named entities in Marathi informal text. The L3Cube-MahaSocialNER dataset offers user-centric information extraction and supports real-time applications, providing a valuable resource for public opinion analysis, news, and marketing on social media platforms. We also show that the zero-shot results of the regular NER model are poor on the social NER test set thus highlighting the need for more social NER datasets. The datasets and models are publicly available at https://github.com/l3cube-pune/MarathiNLP
翻译:本文介绍了L3Cube-MahaSocialNER数据集,这是首个也是规模最大的专为马拉地语命名实体识别(NER)设计的社交媒体数据集。该数据集包含18,000条人工标注的句子,涵盖八种实体类别,有效应对了社交媒体数据中存在的非标准语言及非正式惯用语等挑战。我们采用CNN、LSTM、BiLSTM及Transformer等深度学习模型,分别在IOB与非IOB标注体系下对该数据集进行评测。结果表明,这些模型在马拉地语非正式文本的命名实体准确识别中表现优异。L3Cube-MahaSocialNER数据集实现了面向用户的信息抽取功能,可支持实时应用场景,为社交媒体平台上的舆情分析、新闻挖掘及营销推广提供了宝贵资源。此外,常规NER模型在该社交NER测试集上的零样本表现较差,进一步凸显了构建更多社交NER数据集的必要性。相关数据集与模型已开源发布在https://github.com/l3cube-pune/MarathiNLP。