In this paper, we explore the feasibility of leveraging large language models (LLMs) to automate or otherwise assist human raters with identifying harmful content including hate speech, harassment, violent extremism, and election misinformation. Using a dataset of 50,000 comments, we demonstrate that LLMs can achieve 90% accuracy when compared to human verdicts. We explore how to best leverage these capabilities, proposing five design patterns that integrate LLMs with human rating, such as pre-filtering non-violative content, detecting potential errors in human rating, or surfacing critical context to support human rating. We outline how to support all of these design patterns using a single, optimized prompt. Beyond these synthetic experiments, we share how piloting our proposed techniques in a real-world review queue yielded a 41.5% improvement in optimizing available human rater capacity, and a 9--11% increase (absolute) in precision and recall for detecting violative content.
翻译:本文探讨了利用大语言模型(LLMs)自动化或辅助人工评估者识别有害内容(包括仇恨言论、骚扰、暴力极端主义和选举虚假信息)的可行性。通过使用包含50,000条评论的数据集,我们证明LLMs在与人工判定结果对比时可达到90%的准确率。我们研究了如何最佳利用这些能力,提出了五种将LLMs与人工评估相结合的设计模式,例如预过滤非违规内容、检测人工评估中的潜在错误,或呈现关键上下文以支持人工评估。我们概述了如何使用单一优化提示来支持所有这些设计模式。除了这些模拟实验外,我们还分享了在实际审核队列中试点所提技术的经验:可用人工评估能力优化提升了41.5%,违规内容检测的精确率和召回率绝对提升了9-11%。