Assessing How Hate, Counterspeech, and Toxicity Affect Hate Group Newcomers

from arxiv, 20 pages, 14 figures. arXiv admin note: text overlap with arXiv:2303.13641. Currently in press, Proceedings of the Twentieth International AAAI Conference on Web and Social Media (2024)

Counterspeech has gained attention as a strategy to reduce hate speech on social media. Although previous studies suggest that counterspeech can reduce hate speech, little is known about its effects on participation in online hate communities. Relatedly, we lack an understanding about the degree of hostility in counterspeech. Hostile counterspeech may increase online conflict, potentially hardening the positions of hate adherents, and further eroding online environments. Here, we analyzed the effect of counterspeech on 16,513 newcomers across 104 hate subreddits (forums within Reddit.com). We devised an LLM-based counterspeech detection approach that outperforms specialized models trained on existing datasets, then examined the presence, and effects of, hostility. While counterspeech comments are less toxic than hate speech comments, they are almost twice as toxic as other discourse within hate subreddits. We then evaluated the effect of counterspeech on newcomer engagement in hate subreddits. We found that newcomers using hate speech who receive counterspeech are less likely to continue posting within these hate subreddits, rather than becoming galvanized. We speculate that, instead of constituting ardent hate adherents, readily-dissuaded newcomers may merely be toying with beliefs that are proscribed in other contexts. Although we found no association between the toxicity of counterspeech and its effects on user retention, consistent with prior research regarding the harmful effects of toxic speech, we found that toxic counterspeech increases the probability of continued hostility from hate users within the same discussion.

翻译：反言论作为一种减少社交媒体仇恨言论的策略，已引起广泛关注。尽管先前研究表明反言论能降低仇恨言论，但其对在线仇恨社区参与行为的影响机制尚不明确。我们同样缺乏对反言论中敌意程度的理解——带有敌意的反言论可能激化在线冲突，强化仇恨追随者的立场，并进一步恶化网络环境。本研究分析了104个仇恨子版块（Reddit.com中的论坛）中16,513名新成员受反言论影响的效果。我们设计了一种基于LLM的反言论检测方法，其性能优于基于现有数据集训练的专用模型，并进一步考察了敌意存在的表现及作用。结果显示：反言论评论的毒性虽低于仇恨言论，但其毒性程度几乎是仇恨子版块中其他话语的两倍。我们随后评估了反言论对仇恨子版块中新成员参与行为的影响，发现收到反言论的仇恨言论发布者更倾向于停止在相关子版块发帖，而非被进一步激化。我们推测，这些易被劝阻的新成员可能并非狂热的仇恨追随者，而仅是尝试在其他语境中被禁止的言论。尽管未发现反言论的毒性与其对用户留存率之间的关联，但与本研究的毒性话语有害影响结论一致，我们发现毒性反言论会增加仇恨用户在同一讨论中持续发表敌对言论的概率。