Measuring Safety Alignment Effects in Autonomous Security Agents - 专知论文

会员服务 ·

0

Measuring Safety Alignment Effects in Autonomous Security Agents

翻译：暂无翻译

Isaac David,Arthur Gervais

Do stock safety-aligned language models and their uncensored or abliterated derivatives behave differently when run as autonomous security agents? Single-turn refusal benchmarks cannot answer this question: security agents must inspect repositories, call tools, and produce vulnerability evidence inside authorized sandboxes. We present a trace-based benchmark of 30 local vulnerability-analysis tasks with fixed tools, deterministic success predicates, redaction rules, and grounding checks, and compare four stock models against uncensored or abliterated derivatives: Gemma 4 31B, Gemma 4 26B A4B, Qwen2.5-Coder 7B, and Llama 3.1 8B. The artifact contains 1,500 security-agent traces and 800 non-security control traces. The Gemma pairs show large less-restricted gains on security tasks: 14.0% versus 0.7% success for 31B and 10.7% versus 0.0% for 26B, with higher mean grounding (3.91 versus 3.27 and 4.12 versus 1.64 out of five) and 0.0% refusal, suppressed-action, and unsafe-action rates in the 31B traces. However, controls and non-Gemma pairs rule out a clean security-specific or universal less-restricted effect: Gemma gaps also appear on ordinary coding tasks, Qwen2.5-Coder success is lower for the less-restricted derivative (2.0% versus 5.3%), and the abliterated Llama derivative fails the tool protocol. Across all families, hard proof-of-trigger and patch-verification tasks remain unsolved. These results show that safety alignment effects in autonomous security agents should be measured at the system level, separating refusal, unsafe action, tool reliability, and evidence grounding rather than treating refusal rate as the safety signal.

翻译：暂无翻译

0

相关内容

具身AI安全综述：风险、攻击与防御

具身AI安全综述：风险、攻击与防御

专知会员服务

11+阅读 · 5月6日

《提升生成模型的安全性与保障》博士论文

《提升生成模型的安全性与保障》博士论文

专知会员服务

12+阅读 · 4月20日

【博士论文】重新审视机器人安全性：面向真实世界自主运行的自适应与可扩展方法

【博士论文】重新审视机器人安全性：面向真实世界自主运行的自适应与可扩展方法

专知会员服务

12+阅读 · 2月25日

认知优势：人工智能在国家安全决策中的核心作用

认知优势：人工智能在国家安全决策中的核心作用

专知会员服务

15+阅读 · 2025年8月16日

【CMU博士论文】重新思考面向风险感知的社会型具身智能的安全保障体系

【CMU博士论文】重新思考面向风险感知的社会型具身智能的安全保障体系

专知会员服务

15+阅读 · 2025年5月9日

针对自动驾驶智能模型的攻击与防御

针对自动驾驶智能模型的攻击与防御

专知会员服务

19+阅读 · 2024年6月25日

【CVPR 2022】深度安全多视图聚类:降低因视图增加而导致聚类性能下降的风险，Deep Safe Multi-view Clustering: Reducing the Risk of Clustering Performance Degradation Caused by View Increase

【CVPR 2022】深度安全多视图聚类:降低因视图增加而导致聚类性能下降的风险，Deep Safe Multi-view Clustering: Reducing the Risk of Clustering Performance Degradation Caused by View Increase

专知会员服务

10+阅读 · 2022年3月12日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

【ICML2019 tutorial】安全机器学习（Safe Machine Learning），Silvia Chiappa，Jan Leike

【ICML2019 tutorial】安全机器学习（Safe Machine Learning），Silvia Chiappa，Jan Leike

专知会员服务

23+阅读 · 2019年6月10日

无人预警机系统架构及关键技术分析

无人预警机系统架构及关键技术分析

专知

13+阅读 · 2022年8月6日

Effective.Modern.C++ 中英文版，334页pdf

Effective.Modern.C++ 中英文版，334页pdf

专知

26+阅读 · 2020年11月4日

自动驾驶毫米波雷达物体检测技术-算法

自动驾驶毫米波雷达物体检测技术-算法

CVer

14+阅读 · 2020年5月10日

初学者系列：Attentional Factorization Machines（AFM）详解

初学者系列：Attentional Factorization Machines（AFM）详解

专知

82+阅读 · 2019年9月16日

【泡泡图灵智库】Detect-SLAM：目标检测和SLAM相互收益

【泡泡图灵智库】Detect-SLAM：目标检测和SLAM相互收益

泡泡机器人SLAM

14+阅读 · 2019年6月28日

自动驾驶车辆定位技术概述｜厚势汽车

自动驾驶车辆定位技术概述｜厚势汽车

厚势

10+阅读 · 2019年5月16日

Github项目推荐 | 股市预测的机器学习/深度学习模型/资源集锦

Github项目推荐 | 股市预测的机器学习/深度学习模型/资源集锦

AI研习社

33+阅读 · 2019年4月18日

深度学习中Attention Mechanism详细介绍：原理、分类及应用

深度学习中Attention Mechanism详细介绍：原理、分类及应用

深度学习与NLP

10+阅读 · 2019年2月18日

自动驾驶功能安全评估：基于仿真的故障注入 | 厚势汽车

自动驾驶功能安全评估：基于仿真的故障注入 | 厚势汽车

厚势

14+阅读 · 2018年9月11日

Focal Loss for Dense Object Detection

Focal Loss for Dense Object Detection

统计学习与视觉计算组

12+阅读 · 2018年3月15日

面向主动安全控制的工程车辆动态信息获取与状态辨识

国家自然科学基金

0+阅读 · 2015年12月31日

环境敏感的车载自组网自适应通信机制研究

国家自然科学基金

3+阅读 · 2015年12月31日

混合交通环境中自动驾驶汽车安全可达性分析与优化控制研究

国家自然科学基金

1+阅读 · 2015年12月31日

自我损耗对工作场所安全绩效的影响及缓解途径

国家自然科学基金

0+阅读 · 2015年12月31日

复杂需求场景驱动的软件安全防护模型检测技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于自适应模型检测的安全协议自动建模与设计研究

国家自然科学基金

1+阅读 · 2014年12月31日

隐写模糊安全性测度及其优化嵌入算法研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于可靠度的道路线形参数选用及安全性评价研究

国家自然科学基金

0+阅读 · 2014年12月31日

老年驾驶人风险感知及临界反应能力研究

国家自然科学基金

0+阅读 · 2014年12月31日

地铁施工安全风险时空耦合机理及实景仿真预警技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

From Attacks to Curricula: Learnability-Guided Adversarial Training for Safe Autonomous Driving

Arxiv

0+阅读 · 6月12日

Self-Mined Hardness for Safety Fine-Tuning

Arxiv

0+阅读 · 6月8日

Mission-Level Runtime Assurance Framework for Autonomous Driving

Arxiv

0+阅读 · 6月5日

Membrane: A Self-Evolving Contrastive Safety Memory for LLM Agent Defense

Arxiv

0+阅读 · 6月4日

Involuntary In-Context Learning: Exploiting Few-Shot Pattern Completion to Bypass Safety Alignment in GPT-5.4

Arxiv

0+阅读 · 6月3日

When Autoregressive Consistency Hurts Safety Alignment

Arxiv

0+阅读 · 6月2日

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

Arxiv

0+阅读 · 5月22日

Model-Agnostic Lifelong LLM Safety via Externalized Attack-Defense Co-Evolution

Arxiv

0+阅读 · 5月13日

Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning

Arxiv

0+阅读 · 5月11日

AutoSOUP: Safety-Oriented Unit Proof Generation for Component-level Memory-Safety Verification

Arxiv

0+阅读 · 5月11日

VIP会员

文章信息

相关主题

最新内容

美国马六甲“三重网”概念：安全网、威慑网与杀伤网

美国马六甲“三重网”概念：安全网、威慑网与杀伤网

专知会员服务

3+阅读 · 今天8:18

《面向导弹有效发射时机的监督机器学习方法：基于超视距空战仿真》

《面向导弹有效发射时机的监督机器学习方法：基于超视距空战仿真》

专知会员服务

3+阅读 · 今天7:39

《通用大语言模型：无人机指挥与控制接口》最新40页

《通用大语言模型：无人机指挥与控制接口》最新40页

专知会员服务

8+阅读 · 今天7:33

《通过小型无人机系统将情报能力“作战化”》

《通过小型无人机系统将情报能力“作战化”》

专知会员服务

3+阅读 · 今天7:28

《神经安全型有人–无人协同：面向认知自适应作战能力的参考架构》

《神经安全型有人–无人协同：面向认知自适应作战能力的参考架构》

专知会员服务

5+阅读 · 今天7:14

《在指挥链中通过多准则决策分析传达指挥官意图：空战实验》

《在指挥链中通过多准则决策分析传达指挥官意图：空战实验》

专知会员服务

18+阅读 · 6月15日

消耗优势：美军的“精确规模化”概念

消耗优势：美军的“精确规模化”概念

专知会员服务

7+阅读 · 6月15日

五角大楼的AI优先战略及其对现代战争的启示：来自与伊朗冲突的经验教训

五角大楼的AI优先战略及其对现代战争的启示：来自与伊朗冲突的经验教训

专知会员服务

8+阅读 · 6月15日

《网络空间兵棋推演：挑战、局限性与混合路径》报告

《网络空间兵棋推演：挑战、局限性与混合路径》报告

专知会员服务

8+阅读 · 6月15日

《离线语言支持系统：面向空战战术决策》

《离线语言支持系统：面向空战战术决策》

专知会员服务

8+阅读 · 6月15日

《以通信为中心的6G–LLM架构：面向可扩展的战术自主防御车辆网络》

《以通信为中心的6G–LLM架构：面向可扩展的战术自主防御车辆网络》

专知会员服务

6+阅读 · 6月15日

ICML 2026｜ECA：面向开放式图文生成的高效持续对齐

ICML 2026｜ECA：面向开放式图文生成的高效持续对齐

专知会员服务

6+阅读 · 6月14日

可信智能体AI综述：安全、鲁棒性、隐私与系统安全

可信智能体AI综述：安全、鲁棒性、隐私与系统安全

专知会员服务

6+阅读 · 6月14日

俄乌战场地面机器人如何改写战争规则

俄乌战场地面机器人如何改写战争规则

专知会员服务

9+阅读 · 6月14日

美国海军研究生院第23届年度采购研究研讨会与创新峰会：主题“加速作战能力”，附会议报告论文集1300页

美国海军研究生院第23届年度采购研究研讨会与创新峰会：主题“加速作战能力”，附会议报告论文集1300页

专知会员服务

13+阅读 · 6月14日

相关VIP内容

具身AI安全综述：风险、攻击与防御

具身AI安全综述：风险、攻击与防御

专知会员服务

11+阅读 · 5月6日

《提升生成模型的安全性与保障》博士论文

《提升生成模型的安全性与保障》博士论文

专知会员服务

12+阅读 · 4月20日

【博士论文】重新审视机器人安全性：面向真实世界自主运行的自适应与可扩展方法

【博士论文】重新审视机器人安全性：面向真实世界自主运行的自适应与可扩展方法

专知会员服务

12+阅读 · 2月25日

认知优势：人工智能在国家安全决策中的核心作用

认知优势：人工智能在国家安全决策中的核心作用

专知会员服务

15+阅读 · 2025年8月16日

【CMU博士论文】重新思考面向风险感知的社会型具身智能的安全保障体系

【CMU博士论文】重新思考面向风险感知的社会型具身智能的安全保障体系

专知会员服务

15+阅读 · 2025年5月9日

针对自动驾驶智能模型的攻击与防御

针对自动驾驶智能模型的攻击与防御

专知会员服务

19+阅读 · 2024年6月25日

【CVPR 2022】深度安全多视图聚类:降低因视图增加而导致聚类性能下降的风险，Deep Safe Multi-view Clustering: Reducing the Risk of Clustering Performance Degradation Caused by View Increase

【CVPR 2022】深度安全多视图聚类:降低因视图增加而导致聚类性能下降的风险，Deep Safe Multi-view Clustering: Reducing the Risk of Clustering Performance Degradation Caused by View Increase

专知会员服务

10+阅读 · 2022年3月12日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

【ICML2019 tutorial】安全机器学习（Safe Machine Learning），Silvia Chiappa，Jan Leike

【ICML2019 tutorial】安全机器学习（Safe Machine Learning），Silvia Chiappa，Jan Leike

专知会员服务

23+阅读 · 2019年6月10日

热门VIP内容

开通专知VIP会员享更多权益服务

《面向导弹有效发射时机的监督机器学习方法：基于超视距空战仿真》

《通过小型无人机系统将情报能力“作战化”》

美国马六甲“三重网”概念：安全网、威慑网与杀伤网

《通用大语言模型：无人机指挥与控制接口》最新40页

相关资讯

无人预警机系统架构及关键技术分析

无人预警机系统架构及关键技术分析

专知

13+阅读 · 2022年8月6日

Effective.Modern.C++ 中英文版，334页pdf

Effective.Modern.C++ 中英文版，334页pdf

专知

26+阅读 · 2020年11月4日

自动驾驶毫米波雷达物体检测技术-算法

自动驾驶毫米波雷达物体检测技术-算法

CVer

14+阅读 · 2020年5月10日

初学者系列：Attentional Factorization Machines（AFM）详解

初学者系列：Attentional Factorization Machines（AFM）详解

专知

82+阅读 · 2019年9月16日

【泡泡图灵智库】Detect-SLAM：目标检测和SLAM相互收益

【泡泡图灵智库】Detect-SLAM：目标检测和SLAM相互收益

泡泡机器人SLAM

14+阅读 · 2019年6月28日

自动驾驶车辆定位技术概述｜厚势汽车

自动驾驶车辆定位技术概述｜厚势汽车

厚势

10+阅读 · 2019年5月16日

Github项目推荐 | 股市预测的机器学习/深度学习模型/资源集锦

Github项目推荐 | 股市预测的机器学习/深度学习模型/资源集锦

AI研习社

33+阅读 · 2019年4月18日

深度学习中Attention Mechanism详细介绍：原理、分类及应用

深度学习中Attention Mechanism详细介绍：原理、分类及应用

深度学习与NLP

10+阅读 · 2019年2月18日

自动驾驶功能安全评估：基于仿真的故障注入 | 厚势汽车

自动驾驶功能安全评估：基于仿真的故障注入 | 厚势汽车

厚势

14+阅读 · 2018年9月11日

Focal Loss for Dense Object Detection

Focal Loss for Dense Object Detection

统计学习与视觉计算组

12+阅读 · 2018年3月15日

相关论文

From Attacks to Curricula: Learnability-Guided Adversarial Training for Safe Autonomous Driving

Arxiv

0+阅读 · 6月12日

Self-Mined Hardness for Safety Fine-Tuning

Arxiv

0+阅读 · 6月8日

Mission-Level Runtime Assurance Framework for Autonomous Driving

Arxiv

0+阅读 · 6月5日

Membrane: A Self-Evolving Contrastive Safety Memory for LLM Agent Defense

Arxiv

0+阅读 · 6月4日

Involuntary In-Context Learning: Exploiting Few-Shot Pattern Completion to Bypass Safety Alignment in GPT-5.4

Arxiv

0+阅读 · 6月3日

When Autoregressive Consistency Hurts Safety Alignment

Arxiv

0+阅读 · 6月2日

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

Arxiv

0+阅读 · 5月22日

Model-Agnostic Lifelong LLM Safety via Externalized Attack-Defense Co-Evolution

Arxiv

0+阅读 · 5月13日

Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning

Arxiv

0+阅读 · 5月11日

AutoSOUP: Safety-Oriented Unit Proof Generation for Component-level Memory-Safety Verification

Arxiv

0+阅读 · 5月11日

相关基金

面向主动安全控制的工程车辆动态信息获取与状态辨识

国家自然科学基金

0+阅读 · 2015年12月31日

环境敏感的车载自组网自适应通信机制研究

国家自然科学基金

3+阅读 · 2015年12月31日

混合交通环境中自动驾驶汽车安全可达性分析与优化控制研究

国家自然科学基金

1+阅读 · 2015年12月31日

自我损耗对工作场所安全绩效的影响及缓解途径

国家自然科学基金

0+阅读 · 2015年12月31日

复杂需求场景驱动的软件安全防护模型检测技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于自适应模型检测的安全协议自动建模与设计研究

国家自然科学基金

1+阅读 · 2014年12月31日

隐写模糊安全性测度及其优化嵌入算法研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于可靠度的道路线形参数选用及安全性评价研究

国家自然科学基金

0+阅读 · 2014年12月31日

老年驾驶人风险感知及临界反应能力研究

国家自然科学基金

0+阅读 · 2014年12月31日

地铁施工安全风险时空耦合机理及实景仿真预警技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员