Large pre-trained models have revolutionized natural language processing (NLP) research and applications, but high training costs and limited data resources have prevented their benefits from being shared equally amongst speakers of all the world's languages. To address issues of cross-linguistic access to such models and reduce energy consumption for sustainability during large-scale model training, this study proposes an effective and energy-efficient framework called GreenPLM that uses bilingual lexicons to directly "translate" pre-trained language models of one language into another at almost no additional cost. We validate this approach in 18 languages' BERT models and show that this framework is comparable to, if not better than, other heuristics with high training costs. In addition, given lightweight continued pre-training on limited data where available, this framework outperforms the original monolingual language models in six out of seven tested languages with up to 200x less pre-training efforts. Aiming at the Leave No One Behind Principle (LNOB), our approach manages to reduce inequalities between languages and energy consumption greatly. We make our codes and models publicly available here: \url{https://github.com/qcznlp/GreenPLMs}
翻译:大型预训练模型彻底改变了自然语言处理(NLP)的研究与应用,但高昂的训练成本和有限的数据资源阻碍了全球各语言使用者平等分享其益处。为解决此类模型的跨语言访问问题,并在大规模模型训练过程中降低能耗以实现可持续性,本研究提出了一种高效节能的框架——GreenPLM,该框架利用双语词典将某种语言的预训练语言模型直接“翻译”为另一种语言,且几乎无需额外成本。我们在18种语言的BERT模型上验证了该方法,结果表明该框架与其它高成本启发式方法相当甚至更优。此外,在有限数据条件下进行轻量级持续预训练后,该框架在7种测试语言中的6种上优于原始单语语言模型,且预训练工作量降低高达200倍。秉持“不让任何人掉队”原则(LNOB),我们的方法显著减少了语言间的不平等现象并大幅降低了能耗。相关代码与模型已在此公开:\url{https://github.com/qcznlp/GreenPLMs}