Leveraging research on the neural modelling of Portuguese, we contribute a collection of datasets for an array of language processing tasks and a corresponding collection of fine-tuned neural language models on these downstream tasks. To align with mainstream benchmarks in the literature, originally developed in English, and to kick start their Portuguese counterparts, the datasets were machine-translated from English with a state-of-the-art translation engine. The resulting PORTULAN ExtraGLUE benchmark is a basis for research on Portuguese whose improvement can be pursued in future work. Similarly, the respective fine-tuned neural language models, developed with a low-rank adaptation approach, are made available as baselines that can stimulate future work on the neural processing of Portuguese. All datasets and models have been developed and are made available for two variants of Portuguese: European and Brazilian.
翻译:为推进葡萄牙语神经建模研究,我们贡献了一系列面向多种语言处理任务的数据集,以及针对这些下游任务微调的相应神经语言模型集合。为与文献中主流(最初以英语开发的)基准对齐,并启动对应的葡萄牙语基准,这些数据集通过最先进的翻译引擎从英语机器翻译而成。由此产生的PORTULAN ExtraGLUE基准为葡萄牙语研究提供了基础,未来可进一步优化。类似地,采用低秩适应方法开发的相应微调神经语言模型作为基线开放提供,以激发未来在葡萄牙语神经处理领域的研究。所有数据集和模型均已针对葡萄牙语的两种变体——欧洲葡萄牙语和巴西葡萄牙语——完成开发并开放使用。