Pre-trained language models have revolutionized the natural language understanding landscape, most notably BERT (Bidirectional Encoder Representations from Transformers). However, a significant challenge remains for low-resource languages, where limited data hinders the effective training of such models. This work presents a novel approach to bridge this gap by transferring BERT capabilities from high-resource to low-resource languages using vocabulary matching. We conduct experiments on the Silesian and Kashubian languages and demonstrate the effectiveness of our approach to improve the performance of BERT models even when the target language has minimal training data. Our results highlight the potential of the proposed technique to effectively train BERT models for low-resource languages, thus democratizing access to advanced language understanding models.
翻译:预训练语言模型彻底改变了自然语言理解领域的格局,其中最具代表性的是BERT(来自Transformer的双向编码器表示)。然而,低资源语言仍面临重大挑战:有限的数据阻碍了此类模型的有效训练。本文提出了一种创新方法,通过词汇匹配将BERT能力从高资源语言迁移到低资源语言,从而弥合这一差距。我们在西里西亚语和卡舒布语上进行了实验,证明了即使目标语言的训练数据极少,我们的方法也能有效提升BERT模型的性能。实验结果突显了所提技术在低资源语言上有效训练BERT模型的潜力,从而推动先进语言理解模型的普及化。