Large Language Models (LLMs) have shown remarkable abilities across various tasks, yet their development has predominantly centered on high-resource languages like English and Chinese, leaving low-resource languages underserved. To address this disparity, we present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages. This region, characterized by its rich linguistic diversity, has lacked adequate language technology support. SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese. Leveraging efficient language enhancement techniques and a specially constructed instruction tuning dataset, SeaLLMs 3 significantly reduces training costs while maintaining high performance and versatility. Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models. Additionally, we prioritized safety and reliability by addressing both general and culture-specific considerations and incorporated mechanisms to reduce hallucinations. This work underscores the importance of inclusive AI, showing that advanced LLM capabilities can benefit underserved linguistic and cultural communities.
翻译:大语言模型(Large Language Models, LLMs)已在多种任务中展现出卓越能力,然而其发展主要集中于英语和中文等高资源语言,导致低资源语言服务不足。为解决这一差距,我们推出了SeaLLMs模型家族的最新版本——SeaLLMs 3,专为东南亚语言定制。该地区语言多样性丰富,但长期缺乏足够的语言技术支持。SeaLLMs 3旨在通过覆盖该地区广泛使用的语言来弥合这一鸿沟,包括英语、中文、印尼语、越南语、泰语、他加禄语、马来语、缅甸语、高棉语、老挝语、泰米尔语和爪哇语。通过利用高效的语言增强技术和专门构建的指令微调数据集,SeaLLMs 3在保持高性能和多功能性的同时,显著降低了训练成本。我们的模型在世界知识、数学推理、翻译和指令遵循等任务上表现出色,在同等规模模型中达到了最先进的性能水平。此外,我们通过处理通用及文化特定的考量,优先考虑了安全性和可靠性,并引入了减少幻觉的机制。这项工作强调了包容性人工智能的重要性,表明先进的大语言模型能力能够惠及服务不足的语言和文化社区。