Large language models (LLMs) have proven to be effective tools for a wide range of natural language processing (NLP) applications. Although many LLMs are multilingual, most remain English-centric and perform poorly on low-resource languages. Recently, several Southeast Asia-focused LLMs have been developed, but none are truly open source, as they do not publicly disclose their training data. Truly open-source models are important for transparency and for enabling a deeper and more precise understanding of LLM internals and development, including biases, generalization, and multilinguality. Motivated by recent advances demonstrating the effectiveness of parallel data in improving multilingual performance, we conduct controlled and comprehensive experiments to study the effectiveness of parallel data in continual pretraining of LLMs. Our findings show that using only parallel data is the most effective way to extend an LLM to new languages. Using just 34.7B tokens of parallel data and 180 hours on 8x NVIDIA H200 GPUs, we built OpenSeal, the first truly open Southeast Asian LLM that rivals the performance of existing models of similar size.
翻译:大语言模型已被证明是适用于广泛自然语言处理任务的有效工具。尽管许多大语言模型具备多语言能力,但大多数仍以英语为中心,在低资源语言上表现不佳。近期,多个专注于东南亚语言的模型相继问世,但均非真正开源,因其未公开训练数据。真正的开源模型对于透明度以及深入精确理解大语言模型内部机制与发展(包括偏见、泛化能力和多语言性)至关重要。受近期研究进展的启发——这些进展证明了平行数据在提升多语言性能方面的有效性——我们通过受控且全面的实验,探究平行数据在大语言模型持续预训练中的作用。研究发现,仅使用平行数据是将大语言模型扩展至新语言的最有效途径。仅用347亿词元的平行数据,在8张NVIDIA H200 GPU上耗时180小时,我们构建了首个真正开源的东南亚大语言模型OpenSeal,其性能可与同规模现有模型相媲美。