Lexical normalization, a fundamental task in Natural Language Processing (NLP), involves the transformation of words into their canonical forms. This process has been proven to benefit various downstream NLP tasks greatly. In this work, we introduce Vietnamese Lexical Normalization (ViLexNorm), the first-ever corpus developed for the Vietnamese lexical normalization task. The corpus comprises over 10,000 pairs of sentences meticulously annotated by human annotators, sourced from public comments on Vietnam's most popular social media platforms. Various methods were used to evaluate our corpus, and the best-performing system achieved a result of 57.74% using the Error Reduction Rate (ERR) metric (van der Goot, 2019a) with the Leave-As-Is (LAI) baseline. For extrinsic evaluation, employing the model trained on ViLexNorm demonstrates the positive impact of the Vietnamese lexical normalization task on other NLP tasks. Our corpus is publicly available exclusively for research purposes.
翻译:词汇规范化是自然语言处理(NLP)中的一项基础任务,涉及将词语转换为规范形式。该过程已被证明能够显著提升各类下游NLP任务的效果。本研究首次提出越南语词汇规范化(ViLexNorm)语料库,这是专为越南语词汇规范化任务开发的第一个语料库。该语料库包含从越南最流行社交媒体平台的公开评论中获取、由人工标注者精心标注的10,000余对句子。我们采用多种方法评估该语料库,其中最佳系统在保留原形(LAI)基线方法下,基于错误率降低指标(ERR)(van der Goot, 2019a)取得了57.74%的结果。在外部评估中,使用基于ViLexNorm训练的模型证明了越南语词汇规范化任务对其他NLP任务具有积极影响。本语料库仅限研究用途公开提供。