The recent breakthroughs in Large Language Models (LLMs) have mostly focused on languages with easily available and sufficient resources, such as English. However, there remains a significant gap for languages that lack sufficient linguistic resources in the public domain. Our work introduces Komodo-7B, 7-billion-parameter Large Language Models designed to address this gap by seamlessly operating across Indonesian, English, and 11 regional languages in Indonesia. Komodo-7B is a family of LLMs that consist of Komodo-7B-Base and Komodo-7B-Instruct. Komodo-7B-Instruct stands out by achieving state-of-the-art performance in various tasks and languages, outperforming the benchmarks set by OpenAI's GPT-3.5, Cohere's Aya-101, Llama-2-Chat-13B, Mixtral-8x7B-Instruct-v0.1, Gemma-7B-it , and many more. This model not only demonstrates superior performance in both language-specific and overall assessments but also highlights its capability to excel in linguistic diversity. Our commitment to advancing language models extends beyond well-resourced languages, aiming to bridge the gap for those with limited linguistic assets. Additionally, Komodo-7B-Instruct's better cross-language understanding contributes to addressing educational disparities in Indonesia, offering direct translations from English to 11 regional languages, a significant improvement compared to existing language translation services. Komodo-7B represents a crucial step towards inclusivity and effectiveness in language models, providing to the linguistic needs of diverse communities.
翻译:近期大型语言模型(LLMs)的突破主要集中在资源易获取且充足的语言上,例如英语。然而,对于在公共领域缺乏足够语言资源的语言,仍存在显著空白。本研究推出Komodo-7B,一个拥有70亿参数的大型语言模型,旨在通过无缝覆盖印度尼西亚语、英语及11种印尼区域语言来填补这一空白。Komodo-7B包含Komodo-7B-Base和Komodo-7B-Instruct两个系列模型。其中,Komodo-7B-Instruct在多种任务和语言中取得最先进性能,超越OpenAI的GPT-3.5、Cohere的Aya-101、Llama-2-Chat-13B、Mixtral-8x7B-Instruct-v0.1、Gemma-7B-it等基准模型。该模型不仅在语言特定评估和整体评估中展现卓越性能,还凸显其在语言多样性方面的突出能力。我们致力于推进语言模型超越资源丰富语言的范畴,旨在弥合资源有限语言之间的鸿沟。此外,Komodo-7B-Instruct更强的跨语言理解能力有助于解决印度尼西亚的教育不平等问题:它能将英语直接翻译成11种区域语言,较现有翻译服务有显著提升。Komodo-7B标志着语言模型在包容性与有效性方面迈出关键一步,满足了多元社群的语言需求。