Large language models (LLMs) have exploded in popularity due to their ability to perform a wide array of natural language tasks. Text-based content moderation is one LLM use case that has received recent enthusiasm, however, there is little research investigating how LLMs perform in content moderation settings. In this work, we evaluate a suite of modern, commercial LLMs (GPT-3, GPT-3.5, GPT-4) on two common content moderation tasks: rule-based community moderation and toxic content detection. For rule-based community moderation, we construct 95 LLM moderation-engines prompted with rules from 95 Reddit subcommunities and find that LLMs can be effective at rule-based moderation for many communities, achieving a median accuracy of 64% and a median precision of 83%. For toxicity detection, we find that LLMs significantly outperform existing commercially available toxicity classifiers. However, we also find that recent increases in model size add only marginal benefit to toxicity detection, suggesting a potential performance plateau for LLMs on toxicity detection tasks. We conclude by outlining avenues for future work in studying LLMs and content moderation.
翻译:大型语言模型(LLMs)因其执行多种自然语言任务的能力而急剧流行。基于文本的内容审核是LLMs近期备受关注的一个应用场景,然而,关于LLMs在内容审核环境中表现的研究尚显不足。在本工作中,我们评估了一系列现代商业LLMs(GPT-3、GPT-3.5、GPT-4)在两类常见内容审核任务上的表现:基于规则的社区审核与有毒内容检测。对于基于规则的社区审核,我们构建了95个LLM审核引擎,每个引擎依据来自95个Reddit子社区的规则进行提示,发现LLMs在许多社区中能有效进行基于规则的审核,实现了64%的中位准确率和83%的中位精确率。对于有毒内容检测,我们发现LLMs显著优于现有的商业有毒内容分类器。然而,我们也发现模型规模的近期增长对有毒内容检测仅带来边际效益,这可能表明LLMs在有毒内容检测任务上存在性能平台期。最后,我们概述了未来研究LLMs与内容审核的方向。