Current AI safety frameworks, which often treat harmfulness as binary, lack the flexibility to handle borderline cases where humans meaningfully disagree. To build more pluralistic systems, it is essential to move beyond consensus and instead understand where and why disagreements arise. We introduce PluriHarms, a benchmark designed to systematically study human harm judgments across two key dimensions -- the harm axis (benign to harmful) and the agreement axis (agreement to disagreement). Our scalable framework generates prompts that capture diverse AI harms and human values while targeting cases with high disagreement rates, validated by human data. The benchmark includes 150 prompts with 15,000 ratings from 100 human annotators, enriched with demographic and psychological traits and prompt-level features of harmful actions, effects, and values. Our analyses show that prompts that relate to imminent risks and tangible harms amplify perceived harmfulness, while annotator traits (e.g., toxicity experience, education) and their interactions with prompt content explain systematic disagreement. We benchmark AI safety models and alignment methods on PluriHarms, finding that while personalization significantly improves prediction of human harm judgments, considerable room remains for future progress. By explicitly targeting value diversity and disagreement, our work provides a principled benchmark for moving beyond "one-size-fits-all" safety toward pluralistically safe AI.
翻译:当前的人工智能安全框架通常将危害性视为二元属性,缺乏处理人类存在显著分歧的边缘案例的灵活性。为构建更具多元性的系统,必须超越共识,转而理解分歧在何处以及为何产生。我们提出了PluriHarms基准,旨在系统研究人类危害判断的两个关键维度——危害轴(从良性到有害)与共识轴(从一致到分歧)。我们的可扩展框架生成能够捕捉多样化人工智能危害与人类价值观的提示,同时针对具有高分歧率的案例,并通过人类数据进行验证。该基准包含150个提示,来自100位人类标注者的15,000条评分,并丰富了人口统计与心理特征以及提示层面的危害行为、影响与价值观特征。我们的分析表明,涉及紧迫风险与有形危害的提示会放大感知危害性,而标注者特征(如毒性经历、教育背景)及其与提示内容的交互则能解释系统性分歧。我们在PluriHarms上对人工智能安全模型与对齐方法进行基准测试,发现虽然个性化能显著提升对人类危害判断的预测,但未来仍有相当大的改进空间。通过明确关注价值多样性与分歧,我们的工作为超越“一刀切”的安全范式、迈向多元安全的人工智能提供了原则性基准。