Large language models (LLMs) show remarkable human-like capability in various domains and languages. However, a notable quality gap arises in low-resource languages, e.g., Indonesian indigenous languages, rendering them ineffective and inefficient in such linguistic contexts. To bridge this quality gap, we introduce Cendol, a collection of Indonesian LLMs encompassing both decoder-only and encoder-decoder architectures across a range of model sizes. We highlight Cendol's effectiveness across a diverse array of tasks, attaining 20% improvement, and demonstrate its capability to generalize to unseen tasks and indigenous languages of Indonesia. Furthermore, Cendol models showcase improved human favorability despite their limitations in capturing indigenous knowledge and cultural values in Indonesia. In addition, we discuss the shortcomings of parameter-efficient tunings, such as LoRA, for language adaptation. Alternatively, we propose the usage of vocabulary adaptation to enhance efficiency. Lastly, we evaluate the safety of Cendol and showcase that safety in pre-training in one language such as English is transferable to low-resource languages, such as Indonesian, even without RLHF and safety fine-tuning.
翻译:大型语言模型(LLM)在多个领域和语言中展现出显著的人类能力。然而,在低资源语言(如印尼土著语言)中,存在明显的质量差距,导致其在此类语言环境中效果不佳且效率低下。为弥合这一质量差距,我们提出了Cendol——一系列涵盖仅解码器与编码器-解码器架构、覆盖多种模型规模的印尼语LLM。我们强调Cendol在多样化任务中的有效性,实现了20%的性能提升,并展示了其泛化至未见任务及印尼土著语言的能力。此外,尽管Cendol模型在捕捉印尼本土知识与文化价值观方面存在局限,但其表现更受人类青睐。我们还讨论了参数高效微调方法(如LoRA)在语言适应中的不足,并提出通过词汇适应提升效率。最后,我们评估了Cendol的安全性,并证明在一种语言(如英语)中预训练的安全性可迁移至低资源语言(如印尼语),即便未采用RLHF及安全微调。