Acquiring factual knowledge for language models (LMs) in low-resource languages poses a serious challenge, thus resorting to cross-lingual transfer in multilingual LMs (ML-LMs). In this study, we ask how ML-LMs acquire and represent factual knowledge. Using the multilingual factual knowledge probing dataset, mLAMA, we first conducted a neuron investigation of ML-LMs (specifically, multilingual BERT). We then traced the roots of facts back to the knowledge source (Wikipedia) to identify the ways in which ML-LMs acquire specific facts. We finally identified three patterns of acquiring and representing facts in ML-LMs: language-independent, cross-lingual shared and transferred, and devised methods for differentiating them. Our findings highlight the challenge of maintaining consistent factual knowledge across languages, underscoring the need for better fact representation learning in ML-LMs.
翻译:语言模型(LMs)在低资源语言中获取事实知识面临严峻挑战,因此多语言LM(ML-LMs)需依赖跨语言迁移。本研究探讨ML-LMs如何获取和表征事实知识。基于多语言事实知识探测数据集mLAMA,我们首先对ML-LMs(特指多语言BERT)进行神经元分析,继而追溯事实在知识源(维基百科)中的根源,以识别ML-LMs获取特定事实的途径。最终归纳出ML-LMs获取与表征事实知识的三种模式:语言独立型、跨语言共享型及迁移型,并设计了相应区分方法。研究结果揭示了跨语言保持事实知识一致性的难点,凸显了优化ML-LMs事实表征学习的迫切需求。