In this paper we present PeLLE, a family of large language models based on the RoBERTa architecture, for Brazilian Portuguese, trained on curated, open data from the Carolina corpus. Aiming at reproducible results, we describe details of the pretraining of the models. We also evaluate PeLLE models against a set of existing multilingual and PT-BR refined pretrained Transformer-based LLM encoders, contrasting performance of large versus smaller-but-curated pretrained models in several downstream tasks. We conclude that several tasks perform better with larger models, but some tasks benefit from smaller-but-curated data in its pretraining.
翻译:摘要:本文提出PeLLE,一个基于RoBERTa架构、针对巴西葡萄牙语的大型语言模型系列,该模型使用来自Carolina语料库的经过严格筛选的开放数据进行训练。为追求结果的可复现性,我们详细描述了模型的预训练过程。此外,我们将PeLLE模型与现有的多语言及巴西葡萄牙语精调预训练Transformer编码器模型进行对比评估,在多个下游任务中分析大规模模型与较小但经过严格筛选的预训练模型的性能差异。我们得出结论:多数任务中,大规模模型表现更优,但部分任务受益于预训练数据量较小但更经过严格筛选的模型。