Hate speech annotation is costly, subjective, and prone to annotator disagreement, making large-scale dataset construction challenging. We systematically analyze how well large language models (LLMs) align with human judgments across ten theoretically grounded subjective attributes, such as dehumanization, violence, and sentiment, evaluating both small and large variants of Llama 3.1 and Qwen 2.5. Our analysis reveals a consistent split across all models: behaviorally explicit dimensions (insult, humiliate, attack-defend) correlate strongly with human annotations, while evaluative dimensions (respect, sentiment, hate speech) are systematically inverted. Demographic persona conditioning reduces model confidence without improving alignment. Building on these insights, we propose combining attribute-level LLM predictions via a confidence-weighted Ridge regression to reconstruct continuous hate speech scores from the Measuring Hate Speech corpus, achieving $R^2$ of up to 0.71 and outperforming direct prompting baselines, demonstrating that structured attribute decomposition recovers a richer and more human-aligned signal than end-to-end label prediction alone.
翻译:仇恨言论标注成本高昂、主观性强且易出现标注者分歧,使得大规模数据集构建面临挑战。我们系统分析了大型语言模型(LLM)在十种理论驱动的主观属性维度(如非人化、暴力、情感等)上与人类判断的对齐程度,评估了Llama 3.1与Qwen 2.5的小型及大型变体。分析揭示所有模型存在一致分裂:行为显性维度(侮辱、羞辱、攻防)与人类标注高度相关,而评价性维度(尊重、情感、仇恨言论)则出现系统性反转。人口统计学人格调节虽降低模型置信度,但未能改善对齐效果。基于此发现,我们提出通过置信度加权岭回归整合属性级LLM预测,从仇恨言论测量语料库中重构连续仇恨言论评分,该方法取得最高0.71的$R^2$值,显著优于直接提示基线,证明结构化属性分解能比端到端标签预测恢复更丰富且更符合人类判断的信号。