Abuse in its various forms, including physical, psychological, verbal, sexual, financial, and cultural, has a negative impact on mental health. However, there are limited studies on applying natural language processing (NLP) in this field in Vietnam. Therefore, we aim to contribute by building a human-annotated Vietnamese dataset for detecting abusive content in Vietnamese narrative texts. We sourced these texts from VnExpress, Vietnam's popular online newspaper, where readers often share stories containing abusive content. Identifying and categorizing abusive spans in these texts posed significant challenges during dataset creation, but it also motivated our research. We experimented with lightweight baseline models by freezing PhoBERT and XLM-RoBERTa and using their hidden states in a BiLSTM to assess the complexity of the dataset. According to our experimental results, PhoBERT outperforms other models in both labeled and unlabeled abusive span detection tasks. These results indicate that it has the potential for future improvements.
翻译:各种形式的虐待,包括身体、心理、言语、性、经济和文化方面的虐待,都会对心理健康产生负面影响。然而,在越南,将自然语言处理(NLP)应用于该领域的研究十分有限。因此,我们旨在通过构建一个由人工标注的越南语数据集,用于检测越南语叙事文本中的辱骂性内容,以做出贡献。我们这些文本来源于越南流行的在线报纸VnExpress,其读者经常分享包含辱骂性内容的真实故事。在数据集创建过程中,识别和分类这些文本中的辱骂性片段面临重大挑战,但也激励了我们的研究。我们通过冻结PhoBERT和XLM-RoBERTa,并在BiLSTM中使用其隐藏状态,实验了轻量级基线模型,以评估数据集的复杂度。根据实验结果,在标记和未标记的辱骂性片段检测任务中,PhoBERT均优于其他模型。这些结果表明,该方法在未来具有改进潜力。