Social media data is a valuable resource for research, yet it contains a wide range of non-standard words (NSW). These irregularities hinder the effective operation of NLP tools. Current state-of-the-art methods for the Vietnamese language address this issue as a problem of lexical normalization, involving the creation of manual rules or the implementation of multi-staged deep learning frameworks, which necessitate extensive efforts to craft intricate rules. In contrast, our approach is straightforward, employing solely a sequence-to-sequence (Seq2Seq) model. In this research, we provide a dataset for textual normalization, comprising 2,181 human-annotated comments with an inter-annotator agreement of 0.9014. By leveraging the Seq2Seq model for textual normalization, our results reveal that the accuracy achieved falls slightly short of 70%. Nevertheless, textual normalization enhances the accuracy of the Hate Speech Detection (HSD) task by approximately 2%, demonstrating its potential to improve the performance of complex NLP tasks. Our dataset is accessible for research purposes.
翻译:社交媒体数据是研究领域的重要资源,但其中包含大量非标准词汇,这些不规则现象阻碍了自然语言处理工具的有效运行。当前针对越南语的先进方法将该问题视为词汇规范化任务,涉及手动规则制定或构建多阶段深度学习框架,需要投入大量精力设计复杂规则。相比之下,我们的方法更直接,仅采用序列到序列模型。本研究提供了包含2,181条人工标注评论的文本规范化数据集,标注者间一致性达0.9014。通过应用Seq2Seq模型进行文本规范化,实验结果显示准确率略低于70%。然而,文本规范化使仇恨言论检测任务的准确率提升约2%,证明了其在改善复杂NLP任务性能方面的潜力。我们的数据集可供研究使用。