Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages

Samuel Cahyawijaya,Holy Lovenia,Fajri Koto,Rifki Afina Putri,Emmanuel Dave,Jhonson Lee,Nuur Shadieq,Wawan Cenggoro,Salsabil Maulana Akbar,Muhammad Ihza Mahendra,Dea Annisayanti Putri,Bryan Wilie,Genta Indra Winata,Alham Fikri Aji,Ayu Purwarianti,Pascale Fung

from arxiv, Cendol models are released under Apache 2.0 license and will be made publicly available soon

Large language models (LLMs) show remarkable human-like capability in various domains and languages. However, a notable quality gap arises in low-resource languages, e.g., Indonesian indigenous languages, rendering them ineffective and inefficient in such linguistic contexts. To bridge this quality gap, we introduce Cendol, a collection of Indonesian LLMs encompassing both decoder-only and encoder-decoder architectures across a range of model sizes. We highlight Cendol's effectiveness across a diverse array of tasks, attaining 20% improvement, and demonstrate its capability to generalize to unseen tasks and indigenous languages of Indonesia. Furthermore, Cendol models showcase improved human favorability despite their limitations in capturing indigenous knowledge and cultural values in Indonesia. In addition, we discuss the shortcomings of parameter-efficient tunings, such as LoRA, for language adaptation. Alternatively, we propose the usage of vocabulary adaptation to enhance efficiency. Lastly, we evaluate the safety of Cendol and showcase that safety in pre-training in one language such as English is transferable to low-resource languages, such as Indonesian, even without RLHF and safety fine-tuning.

翻译：大语言模型（LLMs）在多种领域和语言中展现出类人的卓越能力。然而，在低资源语言（例如印度尼西亚本土语言）中仍存在显著的质量差距，导致模型在此类语言环境中效果欠佳且效率低下。为弥合这一质量差距，我们推出了Cendol系列模型——一套涵盖仅解码器与编码器-解码器架构、具有多种模型尺寸的印度尼西亚语大语言模型集合。我们重点展示了Cendol在多样化任务中的有效性（性能提升达20%），并验证了其向未见任务及印度尼西亚本土语言的泛化能力。此外，尽管在捕捉印度尼西亚本土知识与文化价值方面存在局限，Cendol系列模型仍表现出更优的人类偏好度。同时，我们探讨了参数高效调优方法（如LoRA）在语言适应方面的不足，并提出采用词汇适应策略以提升效率。最后，我们评估了Cendol的安全性，结果表明即使未进行强化学习人类反馈（RLHF）与安全微调，在英语等单一语言预训练中获得的安全性亦能迁移至印度尼西亚语等低资源语言。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/