ANCHOLIK-NER is a linguistically diverse dataset for Named Entity Recognition (NER) in Bangla regional dialects, capturing variations across Sylhet, Chittagong, Barishal, Noakhali, and Mymensingh. The dataset has around 17,405 sentences, 3,481 sentences per region. The data was collected from two publicly available datasets and through web scraping from various online newspapers, articles. To ensure high-quality annotations, the BIO tagging scheme was employed, and professional annotators with expertise in regional dialects carried out the labeling process. The dataset is structured into separate subsets for each region and is available in CSV format. Each entry contains textual data along with identified named entities and their corresponding annotations. Named entities are categorized into ten distinct classes: Person, Location, Organization, Food, Animal, Colour, Role, Relation, Object, and Miscellaneous. This dataset serves as a valuable resource for developing and evaluating NER models for Bangla dialectal variations, contributing to regional language processing and low-resource NLP applications. It can be utilized to enhance NER systems in Bangla dialects, improve regional language understanding, and support applications in machine translation, information retrieval, and conversational AI.
翻译:ANCHOLIK-NER 是一个用于孟加拉语地区方言命名实体识别(NER)的语言多样性数据集,涵盖了锡尔赫特、吉大港、巴里萨尔、诺阿卡利和迈门辛等地区的语言变体。该数据集包含约 17,405 个句子,每个地区约 3,481 句。数据来源于两个公开可用的数据集以及通过网页抓取自多家在线报纸和文章。为确保高质量的标注,采用了 BIO 标记方案,并由精通地区方言的专业标注人员执行标注过程。数据集按地区划分为独立的子集,并以 CSV 格式提供。每个条目包含文本数据以及识别出的命名实体及其对应的标注。命名实体被分为十个不同的类别:人物、地点、组织、食物、动物、颜色、角色、关系、物体和其他。该数据集为开发和评估针对孟加拉语方言变体的 NER 模型提供了宝贵资源,有助于地区性语言处理和低资源 NLP 应用。它可用于增强孟加拉语方言的 NER 系统,提升地区性语言理解能力,并支持机器翻译、信息检索和对话式 AI 等应用。