Standard English and Malaysian English exhibit notable differences, posing challenges for natural language processing (NLP) tasks on Malaysian English. Unfortunately, most of the existing datasets are mainly based on standard English and therefore inadequate for improving NLP tasks in Malaysian English. An experiment using state-of-the-art Named Entity Recognition (NER) solutions on Malaysian English news articles highlights that they cannot handle morphosyntactic variations in Malaysian English. To the best of our knowledge, there is no annotated dataset available to improvise the model. To address these issues, we constructed a Malaysian English News (MEN) dataset, which contains 200 news articles that are manually annotated with entities and relations. We then fine-tuned the spaCy NER tool and validated that having a dataset tailor-made for Malaysian English could improve the performance of NER in Malaysian English significantly. This paper presents our effort in the data acquisition, annotation methodology, and thorough analysis of the annotated dataset. To validate the quality of the annotation, inter-annotator agreement was used, followed by adjudication of disagreements by a subject matter expert. Upon completion of these tasks, we managed to develop a dataset with 6,061 entities and 3,268 relation instances. Finally, we discuss on spaCy fine-tuning setup and analysis on the NER performance. This unique dataset will contribute significantly to the advancement of NLP research in Malaysian English, allowing researchers to accelerate their progress, particularly in NER and relation extraction. The dataset and annotation guideline has been published on Github.
翻译:标准英语与马来西亚英语存在显著差异,这对马来西亚英语的自然语言处理任务构成挑战。遗憾的是,现有数据集主要基于标准英语,因此不足以改进马来西亚英语的NLP任务。使用最先进的命名实体识别解决方案对马来西亚英语新闻文章进行的实验表明,这些方案无法处理马来西亚英语中的形态句法变异。据我们所知,目前尚无可用于改进模型的标注数据集。为解决这些问题,我们构建了马来西亚英语新闻数据集,包含200篇手动标注实体和关系的新闻文章。随后我们对spaCy NER工具进行微调,验证了为马来西亚英语量身定制的数据集可显著提升其NER性能。本文介绍了我们在数据采集、标注方法学及标注数据集深度分析方面的工作。为验证标注质量,我们采用标注者间一致性评估,并由领域专家对分歧进行裁定。最终构建了包含6,061个实体和3,268个关系实例的数据集。最后,我们讨论了spaCy微调设置及NER性能分析。这一独特数据集将有力推动马来西亚英语NLP研究的进展,使研究者能够加速在NER和关系抽取领域的研究进程。该数据集及标注指南已发布于GitHub。