Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages

Samuel Cahyawijaya,Holy Lovenia,Fajri Koto,Rifki Afina Putri,Emmanuel Dave,Jhonson Lee,Nuur Shadieq,Wawan Cenggoro,Salsabil Maulana Akbar,Muhammad Ihza Mahendra,Dea Annisayanti Putri,Bryan Wilie,Genta Indra Winata,Alham Fikri Aji,Ayu Purwarianti,Pascale Fung

from arxiv, Cendol models are released under Apache 2.0 license and will be made publicly available soon

Large language models (LLMs) show remarkable human-like capability in various domains and languages. However, a notable quality gap arises in low-resource languages, e.g., Indonesian indigenous languages, rendering them ineffective and inefficient in such linguistic contexts. To bridge this quality gap, we introduce Cendol, a collection of Indonesian LLMs encompassing both decoder-only and encoder-decoder architectures across a range of model sizes. We highlight Cendol's effectiveness across a diverse array of tasks, attaining 20% improvement, and demonstrate its capability to generalize to unseen tasks and indigenous languages of Indonesia. Furthermore, Cendol models showcase improved human favorability despite their limitations in capturing indigenous knowledge and cultural values in Indonesia. In addition, we discuss the shortcomings of parameter-efficient tunings, such as LoRA, for language adaptation. Alternatively, we propose the usage of vocabulary adaptation to enhance efficiency. Lastly, we evaluate the safety of Cendol and showcase that safety in pre-training in one language such as English is transferable to low-resource languages, such as Indonesian, even without RLHF and safety fine-tuning.

翻译：大型语言模型（LLM）在多个领域和语言中展现出显著的人类能力。然而，在低资源语言（如印尼土著语言）中，存在明显的质量差距，导致其在此类语言环境中效果不佳且效率低下。为弥合这一质量差距，我们提出了Cendol——一系列涵盖仅解码器与编码器-解码器架构、覆盖多种模型规模的印尼语LLM。我们强调Cendol在多样化任务中的有效性，实现了20%的性能提升，并展示了其泛化至未见任务及印尼土著语言的能力。此外，尽管Cendol模型在捕捉印尼本土知识与文化价值观方面存在局限，但其表现更受人类青睐。我们还讨论了参数高效微调方法（如LoRA）在语言适应中的不足，并提出通过词汇适应提升效率。最后，我们评估了Cendol的安全性，并证明在一种语言（如英语）中预训练的安全性可迁移至低资源语言（如印尼语），即便未采用RLHF及安全微调。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/