In the rapidly advancing field of AI and NLP, generative large language models (LLMs) stand at the forefront of innovation, showcasing unparalleled abilities in text understanding and generation. However, the limited representation of low-resource languages like Ukrainian poses a notable challenge, restricting the reach and relevance of this technology. Our paper addresses this by fine-tuning the open-source Gemma and Mistral LLMs with Ukrainian datasets, aiming to improve their linguistic proficiency and benchmarking them against other existing models capable of processing Ukrainian language. This endeavor not only aims to mitigate language bias in technology but also promotes inclusivity in the digital realm. Our transparent and reproducible approach encourages further NLP research and development. Additionally, we present the Ukrainian Knowledge and Instruction Dataset (UKID) to aid future efforts in language model fine-tuning. Our research not only advances the field of NLP but also highlights the importance of linguistic diversity in AI, which is crucial for cultural preservation, education, and expanding AI's global utility. Ultimately, we advocate for a future where technology is inclusive, enabling AI to communicate effectively across all languages, especially those currently underrepresented.
翻译:在人工智能与自然语言处理(NLP)这一快速发展的领域中,生成式大语言模型(LLMs)处于创新前沿,展现出文本理解与生成方面无与伦比的能力。然而,乌克兰语等低资源语言表征的局限性构成显著挑战,限制了该技术的覆盖范围与应用价值。本文通过使用乌克兰语数据集微调开源模型Gemma与Mistral以解决这一问题,旨在提升其语言能力,并将其与现有可处理乌克兰语的其他模型进行基准测试。这一工作不仅致力于缓解技术中的语言偏差,同时促进数字领域的包容性。我们采用透明且可复现的研究方法,鼓励进一步开展NLP研究及开发。此外,我们提出乌克兰知识与指令数据集(UKID)以助力未来语言模型微调工作。本研究不仅推动NLP领域发展,更突显语言多样性在人工智能中的重要性——这对文化保护、教育以及拓展人工智能全球实用性至关重要。最终,我们倡导构建包容性技术的未来,使人工智能能够跨越所有语言(尤其是当前表征不足的语言)实现有效通信。