Large language models (LLMs) have exploded in popularity due to their ability to perform a wide array of natural language tasks. Text-based content moderation is one LLM use case that has received recent enthusiasm, however, there is little research investigating how LLMs perform in content moderation settings. In this work, we evaluate a suite of commodity LLMs on two common content moderation tasks: rule-based community moderation and toxic content detection. For rule-based community moderation, we instantiate 95 subcommunity specific LLMs by prompting GPT-3.5 with rules from 95 Reddit subcommunities. We find that GPT-3.5 is effective at rule-based moderation for many communities, achieving a median accuracy of 64% and a median precision of 83%. For toxicity detection, we evaluate a suite of commodity LLMs (GPT-3, GPT-3.5, GPT-4, Gemini Pro, LLAMA 2) and show that LLMs significantly outperform currently widespread toxicity classifiers. However, recent increases in model size add only marginal benefit to toxicity detection, suggesting a potential performance plateau for LLMs on toxicity detection tasks. We conclude by outlining avenues for future work in studying LLMs and content moderation.
翻译:大语言模型因其执行广泛自然语言任务的能力而迅速普及。基于文本的内容审核是近期备受关注的LLM应用场景之一,但关于LLM在内容审核环境中表现的研究仍较为匮乏。本研究评估了一系列商用大语言模型在两项常见内容审核任务中的表现:基于规则的社区审核和有害内容检测。针对基于规则的社区审核,我们通过向GPT-3.5输入来自95个Reddit子社区的规则,构建了95个针对特定子社区的LLM。研究发现,GPT-3.5在多个社区的规则审核中表现有效,中位准确率达64%,中位精确率达83%。在有害内容检测方面,我们对一系列商用LLM(GPT-3、GPT-3.5、GPT-4、Gemini Pro、LLAMA 2)进行了评估,结果表明LLM显著优于当前广泛使用的有害内容分类器。然而,近期模型规模的增加对有害内容检测性能的提升仅产生边际效应,这表明LLM在有害内容检测任务上可能已接近性能瓶颈。最后,我们展望了LLM与内容审核交叉领域的未来研究方向。