Significant strides have been made in natural language tasks, largely attributed to the emergence of powerful large language models (LLMs). These models, pre-trained on extensive and diverse corpora, have become increasingly capable of comprehending the intricacies of language. Despite the abundance of LLMs for many high-resource languages, the availability of such models remains limited for European Portuguese. We introduce Gl\'orIA, a robust European Portuguese decoder LLM. To pre-train Gl\'orIA, we assembled a comprehensive PT-PT text corpus comprising 35 billion tokens from various sources. We present our pre-training methodology, followed by an assessment of the model's effectiveness on multiple downstream tasks. Additionally, to evaluate our models' language modeling capabilities, we introduce CALAME-PT (Context-Aware LAnguage Modeling Evaluation for Portuguese), the first Portuguese zero-shot language-modeling benchmark. Evaluation shows that Gl\'orIA significantly outperforms existing open PT decoder models in language modeling and that it can generate sound, knowledge-rich, and coherent PT-PT text. The model also exhibits strong potential for various downstream tasks.
翻译:自然语言处理任务取得了显著进展,这主要归功于强大大型语言模型的出现。这些模型在海量且多样化的语料库上进行预训练,逐渐具备了理解语言复杂性的能力。尽管许多高资源语言已拥有丰富的LLM,但针对欧洲葡萄牙语的此类模型仍然稀缺。我们提出GlórIA,一个鲁棒的欧洲葡萄牙语解码器LLM。为预训练GlórIA,我们构建了包含来自多个来源的350亿个token的全面葡萄牙语文本语料库。我们介绍了预训练方法论,并在多个下游任务上评估了模型的有效性。此外,为评估模型的语言建模能力,我们提出了CALAME-PT(葡萄牙语上下文感知语言建模评估),这是首个葡萄牙语零样本语言建模基准测试。评估结果表明,GlórIA在语言建模方面显著优于现有开源葡萄牙语解码器模型,并能生成合理、知识丰富且连贯的葡萄牙语文本。该模型在各种下游任务中也展现出强大潜力。