Despite the success of the Universal Dependencies (UD) project exemplified by its impressive language breadth, there is still a lack in `within-language breadth': most treebanks focus on standard languages. Even for German, the language with the most annotations in UD, so far no treebank exists for one of its language varieties spoken by over 10M people: Bavarian. To contribute to closing this gap, we present the first multi-dialect Bavarian treebank (MaiBaam) manually annotated with part-of-speech and syntactic dependency information in UD, covering multiple text genres (wiki, fiction, grammar examples, social, non-fiction). We highlight the morphosyntactic differences between the closely-related Bavarian and German and showcase the rich variability of speakers' orthographies. Our corpus includes 15k tokens, covering dialects from all Bavarian-speaking areas spanning three countries. We provide baseline parsing and POS tagging results, which are lower than results obtained on German and vary substantially between different graph-based parsers. To support further research on Bavarian syntax, we make our dataset, language-specific guidelines and code publicly available.
翻译:尽管通用依存项目(UD)凭借其令人瞩目的跨语言广度取得了成功,但其"语言内广度"仍显不足:大多数树库均聚焦于标准语言。即便对于UD中标注量最大的德语,其拥有超过一千万使用者的语言变体——巴伐利亚语至今仍无树库。为填补这一空白,我们构建了首个多方言巴伐利亚语树库(MaiBaam),该树库采用UD框架手工标注词性标注与句法依存信息,涵盖多类文本体裁(维基、小说、语法示例、社交文本、非虚构作品)。我们重点揭示了巴伐利亚语与近亲德语之间的形态句法差异,并展示了使用者拼写方式的丰富多样性。该语料库包含15,000个标记,覆盖三个国家所有巴伐利亚语方言区的语言变体。我们提供的基线句法分析及词性标注结果均低于德语同类结果,且不同基于图的句法分析器间存在显著差异。为支持巴伐利亚语句法领域的进一步研究,我们已将数据集、语言专项规范及代码公开。