Optimization of offensive content moderation models for different types of hateful messages is typically achieved through continued pre-training or fine-tuning on new hate speech benchmarks. However, existing benchmarks mainly address explicit hate toward protected groups and often overlook implicit or indirect hate, such as demeaning comparisons, calls for exclusion or violence, and subtle discriminatory language that still causes harm. While explicit hate can often be captured through surface features, implicit hate requires deeper, full-model semantic processing. In this work, we question the need for repeated fine-tuning and analyze the role of HatePrototypes, class-level vector representations derived from language models optimized for hate speech detection and safety moderation. We find that these prototypes, built from as few as 50 examples per class, enable cross-task transfer between explicit and implicit hate, with interchangeable prototypes across benchmarks. Moreover, we show that parameter-free early exiting with prototypes is effective for both hate types. We release the code, prototype resources, and evaluation scripts to support future research on efficient and transferable hate speech detection.
翻译:针对不同类型仇恨信息的冒犯性内容审核模型优化,通常通过对新仇恨言论基准数据集进行持续预训练或微调来实现。然而,现有基准主要关注针对受保护群体的显性仇恨,往往忽视隐性或间接仇恨,例如贬损性比较、排斥或暴力煽动,以及仍会造成伤害的微妙歧视性语言。显性仇恨通常可通过表层特征捕捉,而隐性仇恨则需要更深层次的全模型语义处理。本工作中,我们质疑重复微调的必要性,并分析HatePrototypes的作用——这是从专为仇恨言论检测与安全审核优化的语言模型中提取的类别级向量表征。我们发现,仅需每类50个示例构建的这些原型,即可实现显性与隐性仇恨间的跨任务迁移,且原型在不同基准间可互换使用。此外,我们证明基于原型的无参数早期退出机制对两类仇恨检测均有效。我们公开了代码、原型资源与评估脚本,以支持高效可迁移仇恨言论检测的未来研究。