This paper presents ViDeBERTa, a new pre-trained monolingual language model for Vietnamese, with three versions - ViDeBERTa_xsmall, ViDeBERTa_base, and ViDeBERTa_large, which are pre-trained on a large-scale corpus of high-quality and diverse Vietnamese texts using DeBERTa architecture. Although many successful pre-trained language models based on Transformer have been widely proposed for the English language, there are still few pre-trained models for Vietnamese, a low-resource language, that perform good results on downstream tasks, especially Question answering. We fine-tune and evaluate our model on three important natural language downstream tasks, Part-of-speech tagging, Named-entity recognition, and Question answering. The empirical results demonstrate that ViDeBERTa with far fewer parameters surpasses the previous state-of-the-art models on multiple Vietnamese-specific natural language understanding tasks. Notably, ViDeBERTa_base with 86M parameters, which is only about 23% of PhoBERT_large with 370M parameters, still performs the same or better results than the previous state-of-the-art model. Our ViDeBERTa models are available at: https://github.com/HySonLab/ViDeBERTa.
翻译:本文提出了ViDeBERTa,一种针对越南语的新型预训练单语语言模型,包含三个版本——ViDeBERTa_xsmall、ViDeBERTa_base和ViDeBERTa_large,这些模型基于DeBERTa架构,在大规模高质量、多样化的越南语文本语料库上进行了预训练。尽管基于Transformer的许多成功预训练语言模型已广泛应用于英语,但对于越南语这种低资源语言,能够在下游任务(尤其是问答任务)中取得良好表现的预训练模型仍然较少。我们对三个重要的自然语言下游任务进行微调和评估:词性标注、命名实体识别和问答。实验结果表明,参数远少于以往模型的ViDeBERTa在多个越南语特定自然语言理解任务上超越了之前的先进模型。值得注意的是,拥有8600万参数的ViDeBERTa_base模型,仅占拥有3.7亿参数的PhoBERT_large模型约23%的参数规模,仍能获得与先前先进模型相同或更优的结果。我们的ViDeBERTa模型可在以下网址获取:https://github.com/HySonLab/ViDeBERTa。