Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data

We investigate whether neural models trained exclusively on modern morphological data can recover cross-lingual lexical structure consistent with historical reconstruction. Using BantuMorph v7, a transformer over Bantu morphological paradigms, we analyze 14 Eastern and Southern Bantu languages, extract encoder embeddings for their noun and verb lemmas, and identify 728 noun and 1,525 verb cognate candidates shared across 5+ languages. Evaluating these candidates against established historical resources-the Bantu Lexical Reconstructions database (BLR3; 4,786 reconstructed Proto-Bantu forms) and the ASJP basic vocabulary-we confirm 10 of the top 11 noun candidates (90.9%) align with previously reconstructed Proto-Bantu forms, including *-ntU 'person' (8 languages), *gombe 'cow' (9 languages), and *mUn (9 languages). Extending to verbs, 12 verb cognates align with reconstructed Proto-Bantu roots, including *-bon- 'see' and *-jIm- 'stand', each attested across wide geographic ranges. Cross-model validation using an independent translation model (NLLB-600M) confirms these patterns: both models recover cognate clusters and phylogenetic groupings consistent with established Guthrie-zone classifications (p < 0.01). Cross-lingual noun class analysis reveals that all 13 productive classes maintain >0.83 cosine similarity across languages (within-class > between-class, p < 10^-9). Our dataset is restricted to Eastern and Southern Bantu, so we interpret these results as recovering shared Bantu lexical structure consistent with Proto-Bantu rather than definitively distinguishing Proto-Bantu retentions from later regional innovations.

翻译：我们探究是否仅基于现代形态数据训练的神经模型能够恢复与历史重建一致的跨语言词汇结构。利用班图形态学范式上的Transformer模型——BantuMorph v7，我们分析了14种东部和南部班图语言，提取其名词和动词词元的编码器嵌入，并识别出跨越5种以上语言的728个名词和1525个动词同源候选词。将这些候选词与既定的历史资源——班图词汇重建数据库（BLR3；包含4786个重建原始班语形式）和ASJP基础词汇进行对照评估后，我们确认前11个名词候选词中的10个（90.9%）与先前重建的原始班语形式一致，包括*-ntU（“人”，8种语言）、*gombe（“牛”，9种语言）和*mUn（9种语言）。扩展到动词领域，12个动词同源词与重建的原始班语词根一致，包括*-bon-（“看见”）和*-jIm-（“站立”），每个词根在广泛地理范围内均有证据支持。使用独立翻译模型（NLLB-600M）进行的跨模型验证进一步确认了这些模式：两个模型均恢复了与既定Guthrie区域分类一致的同源簇和系统发育分组（p < 0.01）。跨语言名词类别分析显示，所有13个能产类别在各语言间的余弦相似度均高于0.83（类内相似度 > 类间相似度，p < 10^-9）。我们的数据集仅限于东部和南部班图语言，因此我们将这些结果解释为恢复了与原始班语一致的共享班图词汇结构，而非明确区分原始班语保留特征与后期区域创新。