We present a method for discovering morphological features in low-resource Bantu languages by combining cross-lingual transfer learning with unsupervised clustering. Applied to Giriama (nyf), a language with only 91 labeled paradigms, our pipeline discovers noun class assignments for 2,455 words and identifies two previously undocumented morphological patterns: an a- prefix variant for Class 2 (vowel coalescence - the merger of two adjacent vowels - of wa-, 95.1% consistency) and a contracted k'- prefix (98.5% consistency). External validation on 444 known Giriama verb paradigms confirms 78.2% lemmatization accuracy, while a v3 corpus expansion to 19,624 words (9,014 unique lemmas) achieves 97.3% segmentation and 86.7% lemmatization rates across all major word classes. Our ensemble of transfer learning from Swahili and unsupervised clustering, combined via weighted voting, exploits complementary strengths: transfer excels at cognate detection (leveraging ~60% vocabulary overlap) while clustering discovers language-specific innovations invisible to transfer. We release all code and discovered lexicons to support morphological documentation for low-resource Bantu languages.
翻译:我们提出一种结合跨语言迁移学习与无监督聚类的方法,用于在低资源的班图语言中发现形态特征。将该方法应用于仅拥有91个标注词形的Giriama语(nyf),我们的流程为2,455个词汇分配了名词类别,并识别出两种此前未记录的形态模式:第2类中的a-前缀变体(wa-的元音融合——两个相邻元音的合并,一致性为95.1%)及缩合k'-前缀(一致性为98.5%)。对444个已知Giriama动词词形的外部验证表明,词元化准确率达78.2%;同时,将v3语料库扩展至19,624个词汇(含9,014个唯一词元)后,所有主要词类的切分率达97.3%,词元化率达86.7%。通过加权投票整合的斯瓦希里语迁移学习与无监督聚类集成方法有效利用了互补优势:迁移学习在同源词检测上表现优异(利用约60%的词汇重叠),而聚类能发现迁移学习无法捕捉的语言特异性创新。我们已公开所有代码及发现的词汇,以支持低资源班图语的形态记录工作。