As large language models (LLMs) become increasingly prevalent in a wide variety of applications, concerns about the safety of their outputs have become more significant. Most efforts at safety-tuning or moderation today take on a predominantly Western-centric view of safety, especially for toxic, hateful, or violent speech. In this paper, we describe LionGuard, a Singapore-contextualized moderation classifier that can serve as guardrails against unsafe LLM outputs. When assessed on Singlish data, LionGuard outperforms existing widely-used moderation APIs, which are not finetuned for the Singapore context, by 14% (binary) and up to 51% (multi-label). Our work highlights the benefits of localization for moderation classifiers and presents a practical and scalable approach for low-resource languages.
翻译:随着大型语言模型(LLM)在各类应用中的日益普及,其输出内容的安全性愈发受到关注。当前大多数安全调优或内容审核工作主要采用以西方为中心的安全视角,尤其针对有毒、仇恨或暴力言论。本文介绍LionGuard——一个针对新加坡语境优化的内容审核分类器,可作为防范不安全LLM输出的护栏。在新加坡英语(Singlish)数据评估中,LionGuard在二元分类任务上超越未经新加坡语境微调的现有主流审核API达14%,在多标签分类任务上优势最高达51%。本研究凸显了本地化对内容审核分类器的增益价值,并为低资源语言提供了一种实用且可扩展的解决方案。