Attribute-Based Diagnosis of LLM Alignment with Hate Speech Annotations

Hate speech annotation is costly, subjective, and prone to annotator disagreement, making large-scale dataset construction challenging. We systematically analyze how well large language models (LLMs) align with human judgments across ten theoretically grounded subjective attributes, such as dehumanization, violence, and sentiment, evaluating both small and large variants of Llama 3.1 and Qwen 2.5. Our analysis reveals a consistent split across all models: behaviorally explicit dimensions (insult, humiliate, attack-defend) correlate strongly with human annotations, while evaluative dimensions (respect, sentiment, hate speech) are systematically inverted. Demographic persona conditioning reduces model confidence without improving alignment. Building on these insights, we propose combining attribute-level LLM predictions via a confidence-weighted Ridge regression to reconstruct continuous hate speech scores from the Measuring Hate Speech corpus, achieving $R^2$ of up to 0.71 and outperforming direct prompting baselines, demonstrating that structured attribute decomposition recovers a richer and more human-aligned signal than end-to-end label prediction alone.

翻译：仇恨言论标注成本高昂、主观性强且易出现标注者分歧，使得大规模数据集构建面临挑战。我们系统分析了大型语言模型（LLM）在十种理论驱动的主观属性维度（如非人化、暴力、情感等）上与人类判断的对齐程度，评估了Llama 3.1与Qwen 2.5的小型及大型变体。分析揭示所有模型存在一致分裂：行为显性维度（侮辱、羞辱、攻防）与人类标注高度相关，而评价性维度（尊重、情感、仇恨言论）则出现系统性反转。人口统计学人格调节虽降低模型置信度，但未能改善对齐效果。基于此发现，我们提出通过置信度加权岭回归整合属性级LLM预测，从仇恨言论测量语料库中重构连续仇恨言论评分，该方法取得最高0.71的$R^2$值，显著优于直接提示基线，证明结构化属性分解能比端到端标签预测恢复更丰富且更符合人类判断的信号。

相关内容

属性

关注 2

一个具体事物，总是有许许多多的性质与关系，我们把一个事物的性质与关系，都叫作事物的属性。事物与属性是不可分的，事物都是有属性的事物，属性也都是事物的属性。一个事物与另一个事物的相同或相异，也就是一个事物的属性与另一事物的属性的相同或相异。由于事物属性的相同或相异，客观世界中就形成了许多不同的事物类。具有相同属性的事物就形成一类，具有不同属性的事物就分别地形成不同的类。

ACL 2026 | LLMSurgeon：从生成文本诊断大模型训练数据

专知会员服务

9+阅读 · 6月2日

大型语言模型中隐性与显性偏见的综合研究

专知会员服务

17+阅读 · 2025年11月25日

【AAAI2026】Align3GR：面向 LLM 生成式推荐的统一多层次对齐方法

专知会员服务

13+阅读 · 2025年11月17日

大型语言模型（LLM）赋能的知识图谱构建：综述

专知会员服务

56+阅读 · 2025年10月24日