Online hate on social media ranges from overt slurs and threats (\emph{hard hate speech}) to \emph{soft hate speech}: discourse that appears reasonable on the surface but uses framing and value-based arguments to steer audiences toward blaming or excluding a target group. We hypothesize that current moderation systems, largely optimized for surface toxicity cues, are not robust to this reasoning-driven hostility, yet existing benchmarks do not measure this gap systematically. We introduce \textbf{\textsc{SoftHateBench}}, a generative benchmark that produces soft-hate variants while preserving the underlying hostile standpoint. To generate soft hate, we integrate the \emph{Argumentum Model of Topics} (AMT) and \emph{Relevance Theory} (RT) in a unified framework: AMT provides the backbone argument structure for rewriting an explicit hateful standpoint into a seemingly neutral discussion while preserving the stance, and RT guides generation to keep the AMT chain logically coherent. The benchmark spans \textbf{7} sociocultural domains and \textbf{28} target groups, comprising \textbf{4,745} soft-hate instances. Evaluations across encoder-based detectors, general-purpose LLMs, and safety models show a consistent drop from hard to soft tiers: systems that detect explicit hostility often fail when the same stance is conveyed through subtle, reasoning-based language. \textcolor{red}{\textbf{Disclaimer.} Contains offensive examples used solely for research.}
翻译:社交媒体上的在线仇恨言论,从公开的诽谤和威胁(硬仇恨言论)到软仇恨言论,形式多样:后者表面看似合理,实则利用框架和基于价值的论点引导受众指责或排斥目标群体。我们假设,当前主要针对表面毒性线索优化的内容审核系统,对这种推理驱动的敌对言论并不鲁棒,然而现有基准测试未能系统性地衡量这一差距。我们引入了 **SoftHateBench**,一个生成式基准测试,能够生成软仇恨变体,同时保留其潜在的敌对立场。为了生成软仇恨言论,我们将 **议题论证模型**(AMT)和 **关联理论**(RT)整合到一个统一框架中:AMT 为将明确的仇恨立场重写为看似中立的讨论(同时保留立场)提供了主干论证结构,而 RT 则指导生成过程以保持 AMT 论证链的逻辑连贯性。该基准测试涵盖 **7** 个社会文化领域和 **28** 个目标群体,包含 **4,745** 个软仇恨实例。对基于编码器的检测器、通用大语言模型和安全模型的评估显示,从硬仇恨层级到软仇恨层级,检测性能普遍下降:那些能检测明确敌对言论的系统,在面对通过微妙、基于推理的语言传达的相同立场时常常失效。**免责声明:** 包含仅用于研究的冒犯性示例。