Optimization of offensive content moderation models for different types of hateful messages is typically achieved through continued pre-training or fine-tuning on new hate speech benchmarks. However, existing benchmarks mainly address explicit hate toward protected groups and often overlook implicit or indirect hate, such as demeaning comparisons, calls for exclusion or violence, and subtle discriminatory language that still causes harm. While explicit hate can often be captured through surface features, implicit hate requires deeper, full-model semantic processing. In this work, we question the need for repeated fine-tuning and analyze the role of HatePrototypes, class-level vector representations derived from language models optimized for hate speech detection and safety moderation. We find that these prototypes, built from as few as 50 examples per class, enable cross-task transfer between explicit and implicit hate, with interchangeable prototypes across benchmarks. Moreover, we show that parameter-free early exiting with prototypes is effective for both hate types. We release the code, prototype resources, and evaluation scripts to support future research on efficient and transferable hate speech detection.
翻译:针对不同类别仇恨信息的攻击性内容审核模型优化通常通过持续预训练或在新仇恨言论基准上微调实现。然而,现有基准主要针对受保护群体的显式仇恨,往往忽视隐式或间接仇恨,例如贬低性比较、呼吁排斥或暴力的表述,以及仍会造成伤害的细微歧视性语言。显式仇恨可通过表层特征捕捉,而隐式仇恨需要更深层的全模型语义处理。本研究质疑反复微调的必要性,分析HatePrototypes(从针对仇恨言论检测与安全审核优化的语言模型中提取的类别级向量表征)的作用。我们发现,仅依赖每类50个样本构建的此类原型,即可实现显式与隐式仇恨间的跨任务迁移,且基准间原型可互换。此外,我们证明基于原型的无参数提前退出机制对两种仇恨类型均有效。我们公开代码、原型资源及评估脚本,以支持未来关于高效可迁移仇恨言论检测的研究。