Social media data is a valuable resource for research, yet it contains a wide range of non-standard words (NSW). These irregularities hinder the effective operation of NLP tools. Current state-of-the-art methods for the Vietnamese language address this issue as a problem of lexical normalization, involving the creation of manual rules or the implementation of multi-staged deep learning frameworks, which necessitate extensive efforts to craft intricate rules. In contrast, our approach is straightforward, employing solely a sequence-to-sequence (Seq2Seq) model. In this research, we provide a dataset for textual normalization, comprising 2,181 human-annotated comments with an inter-annotator agreement of 0.9014. By leveraging the Seq2Seq model for textual normalization, our results reveal that the accuracy achieved falls slightly short of 70%. Nevertheless, textual normalization enhances the accuracy of the Hate Speech Detection (HSD) task by approximately 2%, demonstrating its potential to improve the performance of complex NLP tasks. Our dataset is accessible for research purposes.
翻译:社交媒体数据是研究的重要资源,但其包含大量非标准词汇。这些不规范现象阻碍了自然语言处理工具的有效运行。当前针对越南语的最先进方法将此问题视为词汇规范化任务,涉及人工规则构建或多阶段深度学习框架的实现,这些方法需要大量精力来设计复杂规则。相比之下,我们的方法采用简单的序列到序列模型。本研究提供了一个包含2,181条人工标注评论的文本规范化数据集,标注者间一致性达到0.9014。通过使用序列到序列模型进行文本规范化,实验结果显示准确率略低于70%。然而,文本规范化将仇恨言论检测任务的准确率提升了约2%,证明了其改善复杂自然语言处理任务性能的潜力。本数据集已开放供研究使用。