This paper presents an approach for adapting the DebertaV3 XSmall model pre-trained in English for Brazilian Portuguese natural language processing (NLP) tasks. A key aspect of the methodology involves a multistep training process to ensure the model is effectively tuned for the Portuguese language. Initial datasets from Carolina and BrWac are preprocessed to address issues like emojis, HTML tags, and encodings. A Portuguese-specific vocabulary of 50,000 tokens is created using SentencePiece. Rather than training from scratch, the weights of the pre-trained English model are used to initialize most of the network, with random embeddings, recognizing the expensive cost of training from scratch. The model is fine-tuned using the replaced token detection task in the same format of DebertaV3 training. The adapted model, called DeBERTinha, demonstrates effectiveness on downstream tasks like named entity recognition, sentiment analysis, and determining sentence relatedness, outperforming BERTimbau-Large in two tasks despite having only 40M parameters.
翻译:本文提出了一种方法,将预训练于英语的DebertaV3 XSmall模型适配至巴葡萄牙语自然语言处理(NLP)任务。该方法的关键在于采用多步骤训练流程,以确保模型能够有效针对葡萄牙语进行调优。首先对来自Carolina和BrWac的初始数据集进行预处理,以处理表情符号、HTML标签和编码等问题。利用SentencePiece创建了一个包含50,000个词元的葡萄牙语专用词汇表。为降低从头训练的高昂成本,模型并非从零开始训练,而是使用预训练英语模型的权重来初始化大部分网络,并采用随机嵌入。模型采用与DebertaV3训练格式相同的替换词元检测任务进行微调。适配后的模型命名为DeBERTinha,在命名实体识别、情感分析和句子关联性判定等下游任务中展现出有效性,尽管仅有4000万参数,却在两项任务上超越了BERTimbau-Large。