Named Entity Recognition (NER) is a fundamental task to extract key information from texts, but annotated resources are scarce for dialects. This paper introduces the first dialectal NER dataset for German, BarNER, with 161K tokens annotated on Bavarian Wikipedia articles (bar-wiki) and tweets (bar-tweet), using a schema adapted from German CoNLL 2006 and GermEval. The Bavarian dialect differs from standard German in lexical distribution, syntactic construction, and entity information. We conduct in-domain, cross-domain, sequential, and joint experiments on two Bavarian and three German corpora and present the first comprehensive NER results on Bavarian. Incorporating knowledge from the larger German NER (sub-)datasets notably improves on bar-wiki and moderately on bar-tweet. Inversely, training first on Bavarian contributes slightly to the seminal German CoNLL 2006 corpus. Moreover, with gold dialect labels on Bavarian tweets, we assess multi-task learning between five NER and two Bavarian-German dialect identification tasks and achieve NER SOTA on bar-wiki. We substantiate the necessity of our low-resource BarNER corpus and the importance of diversity in dialects, genres, and topics in enhancing model performance.
翻译:命名实体识别(NER)是从文本中提取关键信息的基础任务,但针对方言的标注资源十分稀缺。本文首次提出面向德语的方言NER数据集BarNER,该数据集对巴伐利亚维基百科文章(bar-wiki)和推文(bar-tweet)进行了161K词元的标注,标注模式适配自德国CoNLL 2006与GermEval标准。巴伐利亚方言在词汇分布、句法结构及实体信息方面均与标准德语存在差异。我们针对两个巴伐利亚语料库与三个德语语料库开展了域内、跨域、序列及联合实验,首次呈现巴伐利亚语NER的综合结果。融入规模更大的德语NER(子)数据集知识后,模型在bar-wiki上表现显著提升,在bar-tweet上获得适度改善。反之,先以巴伐利亚语训练对经典德国CoNLL 2006语料库产生轻微增益。此外,利用巴伐利亚推文的黄金方言标签,我们评估了五种NER任务与两种巴伐利亚-德语方言识别任务之间的多任务学习,并在bar-wiki上达到NER最新最优水平。研究结果证实了低资源BarNER语料库的必要性,以及方言、体裁与主题多样性对提升模型性能的重要性。