PluriHarms：基准测试人工智能危害的人类判断全谱 (PluriHarms: Benchmarking the Full Spectrum of Human Judgments on AI Harm) - 专知论文

会员服务 ·

0

基准 · 基准测试 · 系统 · 智能安全 · 标注 ·

PluriHarms: Benchmarking the Full Spectrum of Human Judgments on AI Harm

翻译：PluriHarms：基准测试人工智能危害的人类判断全谱

Jing-Jing Li,Joel Mire,Eve Fleisig,Valentina Pyatkin,Anne Collins,Maarten Sap,Sydney Levine

Current AI safety frameworks, which often treat harmfulness as binary, lack the flexibility to handle borderline cases where humans meaningfully disagree. To build more pluralistic systems, it is essential to move beyond consensus and instead understand where and why disagreements arise. We introduce PluriHarms, a benchmark designed to systematically study human harm judgments across two key dimensions -- the harm axis (benign to harmful) and the agreement axis (agreement to disagreement). Our scalable framework generates prompts that capture diverse AI harms and human values while targeting cases with high disagreement rates, validated by human data. The benchmark includes 150 prompts with 15,000 ratings from 100 human annotators, enriched with demographic and psychological traits and prompt-level features of harmful actions, effects, and values. Our analyses show that prompts that relate to imminent risks and tangible harms amplify perceived harmfulness, while annotator traits (e.g., toxicity experience, education) and their interactions with prompt content explain systematic disagreement. We benchmark AI safety models and alignment methods on PluriHarms, finding that while personalization significantly improves prediction of human harm judgments, considerable room remains for future progress. By explicitly targeting value diversity and disagreement, our work provides a principled benchmark for moving beyond "one-size-fits-all" safety toward pluralistically safe AI.

翻译：当前的人工智能安全框架通常将危害性视为二元属性，缺乏处理人类存在显著分歧的边缘案例的灵活性。为构建更具多元性的系统，必须超越共识，转而理解分歧在何处以及为何产生。我们提出了PluriHarms基准，旨在系统研究人类危害判断的两个关键维度——危害轴（从良性到有害）与共识轴（从一致到分歧）。我们的可扩展框架生成能够捕捉多样化人工智能危害与人类价值观的提示，同时针对具有高分歧率的案例，并通过人类数据进行验证。该基准包含150个提示，来自100位人类标注者的15,000条评分，并丰富了人口统计与心理特征以及提示层面的危害行为、影响与价值观特征。我们的分析表明，涉及紧迫风险与有形危害的提示会放大感知危害性，而标注者特征（如毒性经历、教育背景）及其与提示内容的交互则能解释系统性分歧。我们在PluriHarms上对人工智能安全模型与对齐方法进行基准测试，发现虽然个性化能显著提升对人类危害判断的预测，但未来仍有相当大的改进空间。通过明确关注价值多样性与分歧，我们的工作为超越“一刀切”的安全范式、迈向多元安全的人工智能提供了原则性基准。

0

相关内容

可解释人工智能的基础

可解释人工智能的基础

专知会员服务

32+阅读 · 2025年10月26日

《人工智能对抗认知战的基本风险》最新报告30页

《人工智能对抗认知战的基本风险》最新报告30页

专知会员服务

24+阅读 · 2025年7月8日

人工智能伦理风险与治理研究

人工智能伦理风险与治理研究

专知会员服务

20+阅读 · 2025年4月22日

《人工智能安全标准体系（V1.0）》（征求意见稿）

《人工智能安全标准体系（V1.0）》（征求意见稿）

专知会员服务

29+阅读 · 2025年3月23日

《人类-人工智能安全：生成式人工智能和控制系统安全的后继者》

《人类-人工智能安全：生成式人工智能和控制系统安全的后继者》

专知会员服务

43+阅读 · 2024年5月27日

可解释人工智能中的对抗攻击和防御

可解释人工智能中的对抗攻击和防御

专知会员服务

43+阅读 · 2023年6月20日

人工智能的安全性，公平性，可问责性，透明度，一致性，77页ppt

人工智能的安全性，公平性，可问责性，透明度，一致性，77页ppt

专知会员服务

51+阅读 · 2023年5月1日

《人工智能安全测评白皮书》，99页pdf

《人工智能安全测评白皮书》，99页pdf

专知会员服务

378+阅读 · 2022年2月26日

《人工智能安全框架（2020年）》白皮书，68页pdf

《人工智能安全框架（2020年）》白皮书，68页pdf

专知会员服务

167+阅读 · 2021年1月9日

【DeepMind】人工智能、价值与对齐，Artificial Intelligence, Values, and Alignment

【DeepMind】人工智能、价值与对齐，Artificial Intelligence, Values, and Alignment

专知会员服务

38+阅读 · 2020年1月13日

【AI+军事】《用于威胁评估的人工智能工具》加拿大国防研究和发展部技术报告，附中文版pdf

【AI+军事】《用于威胁评估的人工智能工具》加拿大国防研究和发展部技术报告，附中文版pdf

专知

90+阅读 · 2022年4月17日

《人工智能安全测评白皮书》，99页pdf

《人工智能安全测评白皮书》，99页pdf

专知

36+阅读 · 2022年2月26日

集大成者！可解释人工智能(XAI)研究最新进展万字综述论文: 概念体系机遇和挑战—构建负责任的人工智能

集大成者！可解释人工智能(XAI)研究最新进展万字综述论文: 概念体系机遇和挑战—构建负责任的人工智能

专知

38+阅读 · 2019年12月27日

【Science最新论文】XAI—可解释人工智能简述，机遇与挑战

【Science最新论文】XAI—可解释人工智能简述，机遇与挑战

专知

10+阅读 · 2019年12月21日

谷歌可解释人工智能白皮书，27页pdf，Google AI Explainability Whitepaper

谷歌可解释人工智能白皮书，27页pdf，Google AI Explainability Whitepaper

专知

28+阅读 · 2019年12月13日

中国信通院：人工智能安全白皮书（2018年）（附解读及白皮书下载）

中国信通院：人工智能安全白皮书（2018年）（附解读及白皮书下载）

走向智能论坛

27+阅读 · 2018年9月18日

综述 | 一文看尽三种针对人工智能系统的攻击技术及防御策略

综述 | 一文看尽三种针对人工智能系统的攻击技术及防御策略

机器之心

16+阅读 · 2018年7月9日

人工智能对网络空间安全的影响

人工智能对网络空间安全的影响

走向智能论坛

21+阅读 · 2018年6月7日

【知识图谱】肖仰华：基于知识图谱的可解释人工智能：机遇与挑战

【知识图谱】肖仰华：基于知识图谱的可解释人工智能：机遇与挑战

产业智能官

14+阅读 · 2017年11月2日

群体智能：新一代人工智能的重要方向

群体智能：新一代人工智能的重要方向

走向智能论坛

12+阅读 · 2017年8月16日

基于信号理论和众包的社交媒体平台安全性和可信度群体评估方法研究

国家自然科学基金

0+阅读 · 2017年12月31日

基于深度学习的复杂场景下人体行为识别研究

国家自然科学基金

9+阅读 · 2015年12月31日

基于动态增益非线性干扰观测器的多智能体系统协调跟踪和干扰抑制

国家自然科学基金

1+阅读 · 2015年12月31日

阈下情绪启动影响正常人及分裂型特质个体情绪判断的神经机制

国家自然科学基金

0+阅读 · 2015年12月31日

自然灾害风险的时空尺度效应分析与推绎技术研究—以农业旱灾风险为例

国家自然科学基金

0+阅读 · 2014年12月31日

新型统计模型在精神疾病的基因、脑影像和行为数据整合中的应用

国家自然科学基金

0+阅读 · 2014年12月31日

多域网络安全的异构策略语义形态与验证机制

国家自然科学基金

0+阅读 · 2014年12月31日

面向生物威胁快速反应的大数据分析关键技术

国家自然科学基金

0+阅读 · 2014年12月31日

基于人眼关注度与情感分析的电子商务智能推荐计算

国家自然科学基金

0+阅读 · 2014年12月31日

多语言大数据环境下的复杂网络行为分析、预测和干预

国家自然科学基金

4+阅读 · 2014年12月31日

How should AI Safety Benchmarks Benchmark Safety?

Arxiv

0+阅读 · 1月30日

MalURLBench: A Benchmark Evaluating Agents' Vulnerabilities When Processing Web URLs

Arxiv

0+阅读 · 1月30日

The Unintended Trade-off of AI Alignment:Balancing Hallucination Mitigation and Safety in LLMs

Arxiv

0+阅读 · 1月30日

Privy: Envisioning and Mitigating Privacy Risks for Consumer-facing AI Product Concepts

Arxiv

0+阅读 · 1月24日

In Quest of an Extensible Multi-Level Harm Taxonomy for Adversarial AI: Heart of Security, Ethical Risk Scoring and Resilience Analytics

Arxiv

0+阅读 · 1月23日

Harm in AI-Driven Societies: An Audit of Toxicity Adoption on Chirper.ai

Arxiv

0+阅读 · 1月20日

HUMANLLM: Benchmarking and Reinforcing LLM Anthropomorphism via Human Cognitive Patterns

Arxiv

0+阅读 · 1月15日

SafePro: Evaluating the Safety of Professional-Level AI Agents

Arxiv

0+阅读 · 1月13日

Human-Centered Artificial Intelligence (HCAI): Foundations and Approaches

Arxiv

0+阅读 · 1月3日

Harm in AI-Driven Societies: An Audit of Toxicity Adoption on Chirper.ai

Arxiv

0+阅读 · 1月3日

VIP会员

文章信息

相关主题

相关VIP内容

可解释人工智能的基础

可解释人工智能的基础

专知会员服务

32+阅读 · 2025年10月26日

《人工智能对抗认知战的基本风险》最新报告30页

《人工智能对抗认知战的基本风险》最新报告30页

专知会员服务

24+阅读 · 2025年7月8日

人工智能伦理风险与治理研究

人工智能伦理风险与治理研究

专知会员服务

20+阅读 · 2025年4月22日

《人工智能安全标准体系（V1.0）》（征求意见稿）

《人工智能安全标准体系（V1.0）》（征求意见稿）

专知会员服务

29+阅读 · 2025年3月23日

《人类-人工智能安全：生成式人工智能和控制系统安全的后继者》

《人类-人工智能安全：生成式人工智能和控制系统安全的后继者》

专知会员服务

43+阅读 · 2024年5月27日

可解释人工智能中的对抗攻击和防御

可解释人工智能中的对抗攻击和防御

专知会员服务

43+阅读 · 2023年6月20日

人工智能的安全性，公平性，可问责性，透明度，一致性，77页ppt

人工智能的安全性，公平性，可问责性，透明度，一致性，77页ppt

专知会员服务

51+阅读 · 2023年5月1日

《人工智能安全测评白皮书》，99页pdf

《人工智能安全测评白皮书》，99页pdf

专知会员服务

378+阅读 · 2022年2月26日

《人工智能安全框架（2020年）》白皮书，68页pdf

《人工智能安全框架（2020年）》白皮书，68页pdf

专知会员服务

167+阅读 · 2021年1月9日

【DeepMind】人工智能、价值与对齐，Artificial Intelligence, Values, and Alignment

【DeepMind】人工智能、价值与对齐，Artificial Intelligence, Values, and Alignment

专知会员服务

38+阅读 · 2020年1月13日

热门VIP内容

开通专知VIP会员享更多权益服务

智能体记忆深度剖析：评价指标与系统局限性的分类体系及实证分析

《可信人工智能赋能系统的支柱》

【CMU博士论文】可靠轨迹预测的分层基石：数据、评估与方法

人工智能赋能边缘与自主系统：美陆军现代化进程聚焦威胁探测与战术边缘情报

相关资讯

【AI+军事】《用于威胁评估的人工智能工具》加拿大国防研究和发展部技术报告，附中文版pdf

【AI+军事】《用于威胁评估的人工智能工具》加拿大国防研究和发展部技术报告，附中文版pdf

专知

90+阅读 · 2022年4月17日

《人工智能安全测评白皮书》，99页pdf

《人工智能安全测评白皮书》，99页pdf

专知

36+阅读 · 2022年2月26日

集大成者！可解释人工智能(XAI)研究最新进展万字综述论文: 概念体系机遇和挑战—构建负责任的人工智能

集大成者！可解释人工智能(XAI)研究最新进展万字综述论文: 概念体系机遇和挑战—构建负责任的人工智能

专知

38+阅读 · 2019年12月27日

【Science最新论文】XAI—可解释人工智能简述，机遇与挑战

【Science最新论文】XAI—可解释人工智能简述，机遇与挑战

专知

10+阅读 · 2019年12月21日

谷歌可解释人工智能白皮书，27页pdf，Google AI Explainability Whitepaper

谷歌可解释人工智能白皮书，27页pdf，Google AI Explainability Whitepaper

专知

28+阅读 · 2019年12月13日

中国信通院：人工智能安全白皮书（2018年）（附解读及白皮书下载）

中国信通院：人工智能安全白皮书（2018年）（附解读及白皮书下载）

走向智能论坛

27+阅读 · 2018年9月18日

综述 | 一文看尽三种针对人工智能系统的攻击技术及防御策略

综述 | 一文看尽三种针对人工智能系统的攻击技术及防御策略

机器之心

16+阅读 · 2018年7月9日

人工智能对网络空间安全的影响

人工智能对网络空间安全的影响

走向智能论坛

21+阅读 · 2018年6月7日

【知识图谱】肖仰华：基于知识图谱的可解释人工智能：机遇与挑战

【知识图谱】肖仰华：基于知识图谱的可解释人工智能：机遇与挑战

产业智能官

14+阅读 · 2017年11月2日

群体智能：新一代人工智能的重要方向

群体智能：新一代人工智能的重要方向

走向智能论坛

12+阅读 · 2017年8月16日

相关论文

How should AI Safety Benchmarks Benchmark Safety?

Arxiv

0+阅读 · 1月30日

MalURLBench: A Benchmark Evaluating Agents' Vulnerabilities When Processing Web URLs

Arxiv

0+阅读 · 1月30日

The Unintended Trade-off of AI Alignment:Balancing Hallucination Mitigation and Safety in LLMs

Arxiv

0+阅读 · 1月30日

Privy: Envisioning and Mitigating Privacy Risks for Consumer-facing AI Product Concepts

Arxiv

0+阅读 · 1月24日

In Quest of an Extensible Multi-Level Harm Taxonomy for Adversarial AI: Heart of Security, Ethical Risk Scoring and Resilience Analytics

Arxiv

0+阅读 · 1月23日

Harm in AI-Driven Societies: An Audit of Toxicity Adoption on Chirper.ai

Arxiv

0+阅读 · 1月20日

HUMANLLM: Benchmarking and Reinforcing LLM Anthropomorphism via Human Cognitive Patterns

Arxiv

0+阅读 · 1月15日

SafePro: Evaluating the Safety of Professional-Level AI Agents

Arxiv

0+阅读 · 1月13日

Human-Centered Artificial Intelligence (HCAI): Foundations and Approaches

Arxiv

0+阅读 · 1月3日

Harm in AI-Driven Societies: An Audit of Toxicity Adoption on Chirper.ai

Arxiv

0+阅读 · 1月3日

相关基金

基于信号理论和众包的社交媒体平台安全性和可信度群体评估方法研究

国家自然科学基金

0+阅读 · 2017年12月31日

基于深度学习的复杂场景下人体行为识别研究

国家自然科学基金

9+阅读 · 2015年12月31日

基于动态增益非线性干扰观测器的多智能体系统协调跟踪和干扰抑制

国家自然科学基金

1+阅读 · 2015年12月31日

阈下情绪启动影响正常人及分裂型特质个体情绪判断的神经机制

国家自然科学基金

0+阅读 · 2015年12月31日

自然灾害风险的时空尺度效应分析与推绎技术研究—以农业旱灾风险为例

国家自然科学基金

0+阅读 · 2014年12月31日

新型统计模型在精神疾病的基因、脑影像和行为数据整合中的应用

国家自然科学基金

0+阅读 · 2014年12月31日

多域网络安全的异构策略语义形态与验证机制

国家自然科学基金

0+阅读 · 2014年12月31日

面向生物威胁快速反应的大数据分析关键技术

国家自然科学基金

0+阅读 · 2014年12月31日

基于人眼关注度与情感分析的电子商务智能推荐计算

国家自然科学基金

0+阅读 · 2014年12月31日

多语言大数据环境下的复杂网络行为分析、预测和干预

国家自然科学基金

4+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员