Coreference Resolution is a well studied problem in NLP. While widely studied for English and other resource-rich languages, research on coreference resolution in Bengali largely remains unexplored due to the absence of relevant datasets. Bengali, being a low-resource language, exhibits greater morphological richness compared to English. In this article, we introduce a new dataset, BenCoref, comprising coreference annotations for Bengali texts gathered from four distinct domains. This relatively small dataset contains 5200 mention annotations forming 502 mention clusters within 48,569 tokens. We describe the process of creating this dataset and report performance of multiple models trained using BenCoref. We expect that our work provides some valuable insights on the variations in coreference phenomena across several domains in Bengali and encourages the development of additional resources for Bengali. Furthermore, we found poor crosslingual performance at zero-shot setting from English, highlighting the need for more language-specific resources for this task.
翻译:指代消解是自然语言处理中一个研究充分的问题。尽管针对英语和其他资源丰富语言的研究已广泛开展,但由于缺乏相关数据集,孟加拉语的指代消解研究在很大程度上仍属空白。作为一种低资源语言,孟加拉语相较于英语表现出更强的形态丰富性。本文介绍了一个新数据集BenCoref,该数据集包含从四个不同领域收集的孟加拉语文本的指代标注。这个相对较小的数据集包含5200个提及标注,在48,569个词元中形成502个提及簇。我们描述了该数据集的创建过程,并报告了使用BenCoref训练的多个模型的性能表现。我们期望这项工作能为孟加拉语中指代现象跨领域差异提供有价值的见解,并推动孟加拉语更多资源的开发。此外,我们在零样本设置下发现从英语进行的跨语言性能较差,这凸显了为该任务开发更多语言特定资源的必要性。