Named entity recognition is one of the cornerstones of Danish NLP, essential for language technology applications within both industry and research. However, Danish NER is inhibited by a lack of available datasets. As a consequence, no current models are capable of fine-grained named entity recognition, nor have they been evaluated for potential generalizability issues across datasets and domains. To alleviate these limitations, this paper introduces: 1) DANSK: a named entity dataset providing for high-granularity tagging as well as within-domain evaluation of models across a diverse set of domains; 2) DaCy 2.6.0 that includes three generalizable models with fine-grained annotation; and 3) an evaluation of current state-of-the-art models' ability to generalize across domains. The evaluation of existing and new models revealed notable performance discrepancies across domains, which should be addressed within the field. Shortcomings of the annotation quality of the dataset and its impact on model training and evaluation are also discussed. Despite these limitations, we advocate for the use of the new dataset DANSK alongside further work on the generalizability within Danish NER.
翻译:命名实体识别是丹麦语自然语言处理的基石之一,对工业界和研究领域的语言技术应用至关重要。然而,丹麦语NER受限于可用数据集的匮乏。因此,现有模型既无法实现细粒度命名实体识别,也尚未就跨数据集和跨领域的潜在泛化性问题进行评估。为缓解这些限制,本文引入:1) DANSK:一个提供高粒度标注以及在多样化领域内进行模型评估的命名实体数据集;2) 包含三个具备细粒度标注的可泛化模型的DaCy 2.6.0;以及3) 对当前最先进模型跨领域泛化能力的评估。现有模型与新模型的评估揭示了跨领域的显著性能差异,这一问题应得到领域内的重视。本文还讨论了数据集标注质量的不足及其对模型训练和评估的影响。尽管存在这些局限,我们仍倡导在丹麦语NER泛化性研究的基础上使用新数据集DANSK。