Large language models (LLMs) show remarkable human-like capability in various domains and languages. However, a notable quality gap arises in low-resource languages, e.g., Indonesian indigenous languages, rendering them ineffective and inefficient in such linguistic contexts. To bridge this quality gap, we introduce Cendol, a collection of Indonesian LLMs encompassing both decoder-only and encoder-decoder architectures across a range of model sizes. We highlight Cendol's effectiveness across a diverse array of tasks, attaining 20% improvement, and demonstrate its capability to generalize to unseen tasks and indigenous languages of Indonesia. Furthermore, Cendol models showcase improved human favorability despite their limitations in capturing indigenous knowledge and cultural values in Indonesia. In addition, we discuss the shortcomings of parameter-efficient tunings, such as LoRA, for language adaptation. Alternatively, we propose the usage of vocabulary adaptation to enhance efficiency. Lastly, we evaluate the safety of Cendol and showcase that safety in pre-training in one language such as English is transferable to low-resource languages, such as Indonesian, even without RLHF and safety fine-tuning.
翻译:大语言模型(LLMs)在多种领域和语言中展现出类人的卓越能力。然而,在低资源语言(例如印度尼西亚本土语言)中仍存在显著的质量差距,导致模型在此类语言环境中效果欠佳且效率低下。为弥合这一质量差距,我们推出了Cendol系列模型——一套涵盖仅解码器与编码器-解码器架构、具有多种模型尺寸的印度尼西亚语大语言模型集合。我们重点展示了Cendol在多样化任务中的有效性(性能提升达20%),并验证了其向未见任务及印度尼西亚本土语言的泛化能力。此外,尽管在捕捉印度尼西亚本土知识与文化价值方面存在局限,Cendol系列模型仍表现出更优的人类偏好度。同时,我们探讨了参数高效调优方法(如LoRA)在语言适应方面的不足,并提出采用词汇适应策略以提升效率。最后,我们评估了Cendol的安全性,结果表明即使未进行强化学习人类反馈(RLHF)与安全微调,在英语等单一语言预训练中获得的安全性亦能迁移至印度尼西亚语等低资源语言。