In recent years, substantial advancements in pre-trained language models have paved the way for the development of numerous non-English language versions, with a particular focus on encoder-only and decoder-only architectures. While Spanish language models encompassing BERT, RoBERTa, and GPT have exhibited prowess in natural language understanding and generation, there remains a scarcity of encoder-decoder models designed for sequence-to-sequence tasks involving input-output pairs. This paper breaks new ground by introducing the implementation and evaluation of renowned encoder-decoder architectures, exclusively pre-trained on Spanish corpora. Specifically, we present Spanish versions of BART, T5, and BERT2BERT-style models and subject them to a comprehensive assessment across a diverse range of sequence-to-sequence tasks, spanning summarization, rephrasing, and generative question answering. Our findings underscore the competitive performance of all models, with BART and T5 emerging as top performers across all evaluated tasks. As an additional contribution, we have made all models publicly available to the research community, fostering future exploration and development in Spanish language processing.
翻译:近年来,预训练语言模型的重大进展推动了多种非英语语言版本的发展,其中编码器专用架构和解码器专用架构备受关注。尽管涵盖BERT、RoBERTa和GPT的西班牙语模型在自然语言理解与生成任务中表现出色,但专门针对输入-输出对序列到序列任务的编码器-解码器模型仍然稀缺。本文通过引入并评估仅在西班牙语语料库上预训练的经典编码器-解码器架构,开创了新的研究方向。具体而言,我们提出了BART、T5和BERT2BERT风格模型的西班牙语版本,并在涵盖摘要生成、改写和生成式问答的多种序列到序列任务中进行了全面评估。研究结果表明,所有模型均展现出具有竞争力的性能,其中BART和T5在所有评估任务中表现最优。作为额外贡献,我们已将全部模型公开提供给研究社区,以促进西班牙语处理的未来探索与发展。