While extensively explored in text-based tasks, Named Entity Recognition (NER) remains largely neglected in spoken language understanding. Existing resources are limited to a single, English-only dataset. This paper addresses this gap by introducing MSNER, a freely available, multilingual speech corpus annotated with named entities. It provides annotations to the VoxPopuli dataset in four languages (Dutch, French, German, and Spanish). We have also releasing an efficient annotation tool that leverages automatic pre-annotations for faster manual refinement. This results in 590 and 15 hours of silver-annotated speech for training and validation, alongside a 17-hour, manually-annotated evaluation set. We further provide an analysis comparing silver and gold annotations. Finally, we present baseline NER models to stimulate further research on this newly available dataset.
翻译:尽管命名实体识别(NER)在基于文本的任务中已得到广泛探索,但在口语理解领域仍基本被忽视。现有资源仅局限于单一的英语数据集。本文通过引入MSNER——一个自由可用、包含命名实体注释的多语言语音语料库——来填补这一空白。该数据集为VoxPopuli语料库中的四种语言(荷兰语、法语、德语和西班牙语)提供了注释。我们还发布了一款高效的注释工具,该工具利用自动预注释功能加速人工标注精炼过程。最终,我们获得了590小时和15小时的银级标注语音用于训练和验证,以及17小时的人工标注评估集。我们进一步提供了银级与金级注释的对比分析。最后,我们给出了基线NER模型,以促进针对这一新数据集的后续研究。