The recent breakthroughs in Large Language Models (LLMs) have mostly focused on languages with easily available and sufficient resources, such as English. However, there remains a significant gap for languages that lack sufficient linguistic resources in the public domain. Our work introduces Komodo-7B, 7-billion-parameter Large Language Models designed to address this gap by seamlessly operating across Indonesian, English, and 11 regional languages in Indonesia. Komodo-7B is a family of LLMs that consist of Komodo-7B-Base and Komodo-7B-Instruct. Komodo-7B-Instruct stands out by achieving state-of-the-art performance in various tasks and languages, outperforming the benchmarks set by OpenAI's GPT-3.5, Cohere's Aya-101, Llama-2-Chat-13B, Mixtral-8x7B-Instruct-v0.1, Gemma-7B-it , and many more. This model not only demonstrates superior performance in both language-specific and overall assessments but also highlights its capability to excel in linguistic diversity. Our commitment to advancing language models extends beyond well-resourced languages, aiming to bridge the gap for those with limited linguistic assets. Additionally, Komodo-7B-Instruct's better cross-language understanding contributes to addressing educational disparities in Indonesia, offering direct translations from English to 11 regional languages, a significant improvement compared to existing language translation services. Komodo-7B represents a crucial step towards inclusivity and effectiveness in language models, providing to the linguistic needs of diverse communities.
翻译:近期大型语言模型(LLMs)的突破主要集中于资源丰富且易于获取的语言(如英语),但针对公共领域语言资源匮乏的语种仍存在显著缺口。本研究提出Komodo-7B——一个包含70亿参数的大型语言模型系列,旨在无缝处理印尼语、英语及印度尼西亚11种区域语言,以填补这一空白。Komodo-7B系列包括Komodo-7B-Base和Komodo-7B-Instruct两个模型。其中,Komodo-7B-Instruct在多项任务和语言场景中展现出卓越性能,超越了OpenAI的GPT-3.5、Cohere的Aya-101、Llama-2-Chat-13B、Mixtral-8x7B-Instruct-v0.1、Gemma-7B-it等基准模型。该模型不仅在语言特定评估和综合评估中表现优异,更凸显了其在语言多样性处理方面的突出能力。我们致力于推进语言模型发展,不仅服务资源充足的语言,更着力弥合资源匮乏语种的技术鸿沟。此外,Komodo-7B-Instruct凭借其更优的跨语言理解能力,有效应对印度尼西亚的教育差异化问题——该模型支持从英语到11种区域语言的直接翻译,较现有翻译服务实现了显著改进。Komodo-7B标志着语言模型在包容性与实效性方面迈出关键一步,切实满足多元化社群的语言需求。