As internet access expands, so does exposure to harmful content, increasing the need for effective moderation. Research has demonstrated that large language models (LLMs) can be effectively utilized for social media moderation tasks, including harmful content detection. While proprietary LLMs have been shown to zero-shot outperform traditional machine learning models, the out-of-the-box capability of open-weight LLMs remains an open question. Motivated by recent developments of reasoning LLMs, we evaluate seven state-of-the-art models: four proprietary and three open-weight. Testing with real-world posts on Bluesky, moderation decisions by Bluesky Moderation Service, and annotations by two authors, we find a considerable degree of overlap between the sensitivity (81%--97%) and specificity (91%--100%) of the open-weight LLMs and those (72%--98%, and 93%--99%) of the proprietary ones. Additionally, our analysis reveals that specificity exceeds sensitivity for rudeness detection, but the opposite holds for intolerance and threats. Lastly, we identify inter-rater agreement across human moderators and the LLMs, highlighting considerations for deploying LLMs in both platform-scale and personalized moderation contexts. These findings show open-weight LLMs can support privacy-preserving moderation on consumer-grade hardware and suggest new directions for designing moderation systems that balance community values with individual user preferences.
翻译:随着互联网普及度的提升,有害内容的暴露风险也相应增加,这使得高效的内容审核机制变得尤为重要。已有研究表明,大型语言模型(LLMs)能够有效应用于社交媒体内容审核任务,包括有害内容检测。虽然已有研究证明专有LLM在零样本场景下能够超越传统机器学习模型,但开源权重LLM的即用性能仍存在疑问。受近期推理型LLM发展的启发,我们评估了七种前沿模型:四种专有模型与三种开源权重模型。通过使用Bluesky平台上的真实帖文、Bluesky审核服务的决策记录以及两位作者的标注数据进行测试,我们发现开源权重LLM在敏感度(81%--97%)和特异度(91%--100%)方面与专有模型(敏感度72%--98%,特异度93%--99%)存在显著重叠区间。此外,分析表明在粗鲁言论检测中特异度高于敏感度,而在偏执言论和威胁检测中则呈现相反趋势。最后,我们揭示了人类审核员与LLM之间存在的评分者间一致性,这为在平台级和个性化审核场景中部署LLM提供了重要考量依据。这些发现表明开源权重LLM能够在消费级硬件上支持隐私保护型内容审核,并为设计平衡社区价值观与用户个体偏好的审核系统指明了新方向。