xList-Hate：一种基于检查表的可解释与可泛化仇恨言论检测框架 (xList-Hate: A Checklist-Based Framework for Interpretable and Generalizable Hate Speech Detection)

Hate speech detection is commonly framed as a direct binary classification problem despite being a composite concept defined through multiple interacting factors that vary across legal frameworks, platform policies, and annotation guidelines. As a result, supervised models often overfit dataset-specific definitions and exhibit limited robustness under domain shift and annotation noise. We introduce xList-Hate, a diagnostic framework that decomposes hate speech detection into a checklist of explicit, concept-level questions grounded in widely shared normative criteria. Each question is independently answered by a large language model (LLM), producing a binary diagnostic representation that captures hateful content features without directly predicting the final label. These diagnostic signals are then aggregated by a lightweight, fully interpretable decision tree, yielding transparent and auditable predictions. We evaluate it across multiple hate speech benchmarks and model families, comparing it against zero-shot LLM classification and in-domain supervised fine-tuning. While supervised methods typically maximize in-domain performance, we consistently improves cross-dataset robustness and relative performance under domain shift. In addition, qualitative analysis of disagreement cases provides evidence that the framework can be less sensitive to certain forms of annotation inconsistency and contextual ambiguity. Crucially, the approach enables fine-grained interpretability through explicit decision paths and factor-level analysis. Our results suggest that reframing hate speech detection as a diagnostic reasoning task, rather than a monolithic classification problem, provides a robust, explainable, and extensible alternative for content moderation.

翻译：仇恨言论检测通常被直接视为二元分类问题，但仇恨言论本身是一个复合概念，其定义涉及多个相互作用的因素，且这些因素因法律框架、平台政策和标注指南的不同而存在差异。因此，监督模型往往过度拟合数据集特定的定义，在领域偏移和标注噪声下表现出有限的鲁棒性。我们提出了 xList-Hate，一种诊断框架，它将仇恨言论检测分解为一个基于广泛共享的规范性准则、由明确的概念层面问题构成的检查表。每个问题由一个大型语言模型（LLM）独立回答，生成一个二元的诊断表示，该表示捕获仇恨内容特征而不直接预测最终标签。这些诊断信号随后由一个轻量级、完全可解释的决策树进行聚合，从而产生透明且可审计的预测。我们在多个仇恨言论基准测试和模型族上对其进行了评估，并与零样本 LLM 分类和领域内监督微调方法进行了比较。尽管监督方法通常在领域内性能上达到最优，但我们的方法在跨数据集鲁棒性和领域偏移下的相对性能方面持续提升。此外，对不一致案例的定性分析表明，该框架对某些形式的标注不一致性和上下文模糊性可能较不敏感。至关重要的是，该方法通过明确的决策路径和因素层面分析实现了细粒度的可解释性。我们的研究结果表明，将仇恨言论检测重新构建为诊断推理任务，而非单一的分类问题，为内容审核提供了一种鲁棒、可解释且可扩展的替代方案。