Many under-resourced languages require high-quality datasets for specific tasks such as offensive language detection, disinformation, or misinformation identification. However, the intricacies of the content may have a detrimental effect on the annotators. The article aims to revisit an approach of pseudo-labeling sensitive data on the example of Ukrainian tweets covering the Russian-Ukrainian war. Nowadays, this acute topic is in the spotlight of various language manipulations that cause numerous disinformation and profanity on social media platforms. The conducted experiment highlights three main stages of data annotation and underlines the main obstacles during machine annotation. Ultimately, we provide a fundamental statistical analysis of the obtained data, evaluation of models used for pseudo-labelling, and set further guidelines on how the scientists can leverage the corpus to execute more advanced research and extend the existing data samples without annotators' engagement.
翻译:许多资源匮乏的语言在特定任务(如攻击性语言检测、虚假信息或错误信息识别)中需要高质量的数据集。然而,内容的复杂性可能对标注者产生不利影响。本文旨在以涵盖俄乌战争的乌克兰语推文为例,重新审视一种对敏感数据进行伪标签的方法。如今,这一热点话题成为各种语言操控的焦点,导致社交媒体平台上出现大量虚假信息和辱骂性内容。所开展的实验突出了数据标注的三个主要阶段,并强调了机器标注过程中的主要障碍。最终,我们提供了对获取数据的基础统计分析、对用于伪标签的模型的评估,并制定了进一步指南,指导科学家如何利用该语料库进行更深入的研究,以及在无需标注者参与的情况下扩展现有数据样本。