Komodo: A Linguistic Expedition into Indonesia's Regional Languages

The recent breakthroughs in Large Language Models (LLMs) have mostly focused on languages with easily available and sufficient resources, such as English. However, there remains a significant gap for languages that lack sufficient linguistic resources in the public domain. Our work introduces Komodo-7B, 7-billion-parameter Large Language Models designed to address this gap by seamlessly operating across Indonesian, English, and 11 regional languages in Indonesia. Komodo-7B is a family of LLMs that consist of Komodo-7B-Base and Komodo-7B-Instruct. Komodo-7B-Instruct stands out by achieving state-of-the-art performance in various tasks and languages, outperforming the benchmarks set by OpenAI's GPT-3.5, Cohere's Aya-101, Llama-2-Chat-13B, Mixtral-8x7B-Instruct-v0.1, Gemma-7B-it , and many more. This model not only demonstrates superior performance in both language-specific and overall assessments but also highlights its capability to excel in linguistic diversity. Our commitment to advancing language models extends beyond well-resourced languages, aiming to bridge the gap for those with limited linguistic assets. Additionally, Komodo-7B-Instruct's better cross-language understanding contributes to addressing educational disparities in Indonesia, offering direct translations from English to 11 regional languages, a significant improvement compared to existing language translation services. Komodo-7B represents a crucial step towards inclusivity and effectiveness in language models, providing to the linguistic needs of diverse communities.

翻译：近期大型语言模型（LLMs）的突破主要集中于资源丰富且易于获取的语言（如英语），但针对公共领域语言资源匮乏的语种仍存在显著缺口。本研究提出Komodo-7B——一个包含70亿参数的大型语言模型系列，旨在无缝处理印尼语、英语及印度尼西亚11种区域语言，以填补这一空白。Komodo-7B系列包括Komodo-7B-Base和Komodo-7B-Instruct两个模型。其中，Komodo-7B-Instruct在多项任务和语言场景中展现出卓越性能，超越了OpenAI的GPT-3.5、Cohere的Aya-101、Llama-2-Chat-13B、Mixtral-8x7B-Instruct-v0.1、Gemma-7B-it等基准模型。该模型不仅在语言特定评估和综合评估中表现优异，更凸显了其在语言多样性处理方面的突出能力。我们致力于推进语言模型发展，不仅服务资源充足的语言，更着力弥合资源匮乏语种的技术鸿沟。此外，Komodo-7B-Instruct凭借其更优的跨语言理解能力，有效应对印度尼西亚的教育差异化问题——该模型支持从英语到11种区域语言的直接翻译，较现有翻译服务实现了显著改进。Komodo-7B标志着语言模型在包容性与实效性方面迈出关键一步，切实满足多元化社群的语言需求。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日