Despite their linguistic diversity and global significance, African languages remain underrepresented in research and resources to support NLP. We aim to bridge this gap by introducing AfriSUD, the first large-scale collection of syntactically annotated treebanks for nine diverse African languages spanning major language families and regions across Sub-Saharan Africa. Using the Surface-Syntactic Universal Dependencies (SUD) framework, our community-led effort provides high-quality, native-speaker verified data that capture typological key features such as agglutination and tone. We evaluate a range of models on AfriSUD for part-of-speech tagging and dependency parsing including non-transformer baselines, multilingual pretrained encoders, and LLMs. Our results reveal a significant syntax gap, where models still show clear limitations across the nine languages, suggesting that existing architectures may not fully capture the structural diversity of African-language syntax.
翻译:尽管非洲语言具有丰富的语言多样性和全球重要性,但它们在支持自然语言处理的研究与资源中仍显不足。为弥合这一鸿沟,我们引入AfriSUD——首个大规模涵盖九种跨越撒哈拉以南非洲主要语系与地区的非洲语言的句法标注树库集合。依托表层句法通用依存框架,这一社区主导的工作提供了经母语者验证的高质量数据,捕捉了黏着性、声调等类型学关键特征。我们基于AfriSUD评估了多种模型在词性标注与依存句法分析任务上的表现,包括非Transformer基线模型、多语言预训练编码器及大型语言模型。实验结果表明显著的句法鸿沟:模型在九种语言上仍表现出明显局限性,揭示了现有架构可能无法充分捕捉非洲语言句法的结构多样性。