High-quality text representations are crucial for natural language understanding (NLU), but low-resource languages like Vietnamese face challenges due to limited annotated data. While pre-trained models like PhoBERT and CafeBERT perform well, their effectiveness is constrained by data scarcity. Contrastive learning (CL) has recently emerged as a promising approach for improving sentence representations, enabling models to effectively distinguish between semantically similar and dissimilar sentences. We propose ViCLSR (Vietnamese Contrastive Learning for Sentence Representations), a novel supervised contrastive learning framework specifically designed to optimize sentence embeddings for Vietnamese, leveraging existing natural language inference (NLI) datasets. Additionally, we propose a process to adapt existing Vietnamese datasets for supervised learning, ensuring compatibility with CL methods. Our experiments demonstrate that ViCLSR significantly outperforms the powerful monolingual pre-trained model PhoBERT on five benchmark NLU datasets such as ViNLI (+6.97% F1), ViWikiFC (+4.97% F1), ViFactCheck (+9.02% F1), UIT-ViCTSD (+5.36% F1), and ViMMRC2.0 (+4.33% Accuracy). ViCLSR shows that supervised contrastive learning can effectively address resource limitations in Vietnamese NLU tasks and improve sentence representation learning for low-resource languages. Furthermore, we conduct an in-depth analysis of the experimental results to uncover the factors contributing to the superior performance of contrastive learning models. ViCLSR is released for research purposes in advancing natural language processing tasks.
翻译:高质量文本表示对于自然语言理解(NLU)至关重要,但越南语等低资源语言因标注数据有限而面临挑战。虽然PhoBERT和CafeBERT等预训练模型表现良好,但其有效性受到数据稀缺的限制。对比学习(CL)近期成为改善句子表示的前沿方法,使模型能够有效区分语义相近与相异的句子。我们提出ViCLSR(越南语句子表示对比学习框架),这是一种专为优化越南语句子嵌入而设计的新型监督对比学习框架,利用现有自然语言推理(NLI)数据集。此外,我们提出了一种将现有越南语数据集适配至监督学习的方法,确保与对比学习方法的兼容性。实验表明,ViCLSR在五个基准NLU数据集(如ViNLI +6.97% F1、ViWikiFC +4.97% F1、ViFactCheck +9.02% F1、UIT-ViCTSD +5.36% F1、ViMMRC2.0 +4.33% Accuracy)上显著优于强大的单语预训练模型PhoBERT。ViCLSR证明监督对比学习能有效解决越南语NLU任务中的资源限制问题,并改善低资源语言的句子表示学习。我们进一步对实验结果进行深入分析,揭示对比学习模型取得优越性能的关键因素。ViCLSR已发布用于推进自然语言处理任务的学术研究。