Large language models (LLMs) have significantly advanced natural language processing, but their progress has yet to be equal across languages. While most LLMs are trained in high-resource languages like English, multilingual models generally underperform monolingual ones. Additionally, aspects of their multilingual foundation sometimes restrict the byproducts they produce, like computational demands and licensing regimes. In this study, we document the development of open-foundation models tailored for use in low-resource settings, their limitations, and their benefits. This is the TeenyTinyLlama pair: two compact models for Brazilian Portuguese text generation. We release them under the permissive Apache 2.0 license on GitHub and Hugging Face for community use and further development. See https://github.com/Nkluge-correa/TeenyTinyLlama
翻译:大语言模型(LLMs)显著推动了自然语言处理的发展,但其进展在不同语言之间尚未均衡。多数大语言模型基于英语等高资源语言进行训练,而多语言模型整体表现通常逊于单语言模型。此外,多语言基础模型的某些特性有时会限制其衍生成果,如计算开销与许可协议模式。本研究系统记录了面向低资源配置场景的开源基础模型开发过程,及其局限性优势。我们提出TeenyTinyLlama模型对:两个专为巴西葡萄牙语文本生成设计的紧凑模型。基于宽松的Apache 2.0协议,我们已在GitHub与Hugging Face平台公开发布该模型,供社区使用与二次开发。详见 https://github.com/Nkluge-correa/TeenyTinyLlama