Building LLMs for languages other than English is in great demand due to the unavailability and performance of multilingual LLMs, such as understanding the local context. The problem is critical for low-resource languages due to the need for instruction sets. In a multilingual country like India, there is a need for LLMs supporting Indic languages to provide generative AI and LLM-based technologies and services to its citizens. This paper presents our approach of i) generating a large Odia instruction set, including domain knowledge data suitable for LLM fine-tuning, and ii) building a Llama2-finetuned model tailored for enhanced performance in the Odia domain. The proposed work will help researchers build an instruction set and LLM, particularly for Indic languages. We will release the model and instruction set for the public for research and noncommercial purposes.
翻译:由于多语言大语言模型(如理解本地语境方面)的可用性不足且性能有限,针对英语以外语言构建大语言模型的需求十分迫切。对于低资源语言而言,由于缺乏指令集,该问题尤为严峻。在印度这样的多语言国家,需要支持印度诸语言的大语言模型,以便为国民提供基于生成式AI和大语言模型的技术与服务。本文介绍了我们的方法:i) 生成包含适合大语言模型微调的领域知识数据的大型奥里亚语指令集,以及ii) 构建专为提升奥里亚语领域性能而优化的Llama2微调模型。本项工作将有助于研究人员构建指令集和大语言模型,特别是针对印度诸语言。我们将公开发布该模型和指令集,供研究与非商业用途使用。