The detection of sensitive content in large datasets is crucial for ensuring that shared and analysed data is free from harmful material. However, current moderation tools, such as external APIs, suffer from limitations in customisation, accuracy across diverse sensitive categories, and privacy concerns. Additionally, existing datasets and open-source models focus predominantly on toxic language, leaving gaps in detecting other sensitive categories such as substance abuse or self-harm. In this paper, we put forward a unified dataset tailored for social media content moderation across six sensitive categories: conflictual language, profanity, sexually explicit material, drug-related content, self-harm, and spam. By collecting and annotating data with consistent retrieval strategies and guidelines, we address the shortcomings of previous focalised research. Our analysis demonstrates that fine-tuning large language models (LLMs) on this novel dataset yields significant improvements in detection performance compared to open off-the-shelf models such as LLaMA, and even proprietary OpenAI models, which underperform by 10-15% overall. This limitation is even more pronounced on popular moderation APIs, which cannot be easily tailored to specific sensitive content categories, among others.
翻译:在大规模数据集中检测敏感内容对于确保共享和分析数据免受有害材料污染至关重要。然而,当前的审核工具(如外部API)存在定制化程度低、跨不同敏感类别准确性不足以及隐私保护问题等局限性。此外,现有数据集和开源模型主要集中于毒性语言检测,在药物滥用或自残等其他敏感类别的识别方面存在空白。本文提出了一个针对社交媒体内容审核的统一数据集,涵盖六大敏感类别:冲突性语言、污言秽语、色情内容、药物相关内容、自残行为和垃圾信息。通过采用一致的检索策略与标注准则进行数据收集和标注,我们弥补了以往聚焦式研究的不足。分析表明,基于该新颖数据集对大型语言模型(LLMs)进行微调,相比现成的开源模型(如LLaMA)甚至专有的OpenAI模型,在检测性能上取得了显著提升——后者整体表现低10-15%。这一局限在主流审核API上更为突出,因其难以针对特定敏感内容类别进行定制化适配。