X-Boundary：建立精确安全边界以在不损害可用性的前提下保护大语言模型免受多轮越狱攻击 (X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability) - 专知论文

会员服务 ·

0

表示 · 越狱 · 攻击 · 越狱攻击 · 语言模型 ·

2025 年 12 月 26 日

X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability

翻译：X-Boundary：建立精确安全边界以在不损害可用性的前提下保护大语言模型免受多轮越狱攻击

Xiaoya Lu,Dongrui Liu,Yi Yu,Luxin Xu,Jing Shao

Despite the rapid development of safety alignment techniques for LLMs, defending against multi-turn jailbreaks is still a challenging task. In this paper, we conduct a comprehensive comparison, revealing that some existing defense methods can improve the robustness of LLMs against multi-turn jailbreaks but compromise usability, i.e., reducing general capabilities or causing the over-refusal problem. From the perspective of mechanism interpretability of LLMs, we discover that these methods fail to establish a boundary that exactly distinguishes safe and harmful feature representations. Therefore, boundary-safe representations close to harmful representations are inevitably disrupted, leading to a decline in usability. To address this issue, we propose X-Boundary to push harmful representations away from boundary-safe representations and obtain an exact distinction boundary. In this way, harmful representations can be precisely erased without disrupting safe ones. Experimental results show that X-Boundary achieves state-of-the-art defense performance against multi-turn jailbreaks, while reducing the over-refusal rate by about 20% and maintaining nearly complete general capability. Furthermore, we theoretically prove and empirically verify that X-Boundary can accelerate the convergence process during training. Please see our code at: https://github.com/AI45Lab/X-Boundary.

翻译：尽管大语言模型的安全对齐技术发展迅速，但防御多轮越狱攻击仍是一项具有挑战性的任务。本文通过全面比较发现，现有的一些防御方法虽能提升大语言模型抵抗多轮越狱攻击的鲁棒性，却会损害其可用性，即降低通用能力或引发过度拒绝问题。从大语言模型机制可解释性的角度分析，我们发现这些方法未能建立能够精确区分安全与有害特征表示的边界。因此，靠近有害表示的边界安全表示不可避免地受到干扰，导致可用性下降。为解决这一问题，我们提出X-Boundary方法，通过将有害表示推离边界安全表示，从而获得精确的区分边界。这种方式可以精准消除有害表示而不干扰安全表示。实验结果表明，X-Boundary在多轮越狱防御中实现了最先进的性能，同时将过度拒绝率降低约20%，并保持近乎完整的通用能力。此外，我们从理论上证明并通过实验验证，X-Boundary能够加速训练过程中的收敛进程。代码详见：https://github.com/AI45Lab/X-Boundary。

0

相关内容

【ICML2025】隐私防护图像压缩：防御视觉-语言预训练模型的滥用

【ICML2025】隐私防护图像压缩：防御视觉-语言预训练模型的滥用

专知会员服务

5+阅读 · 2025年6月22日

大语言模型越狱攻击：模型、根因及其攻防演化

大语言模型越狱攻击：模型、根因及其攻防演化

专知会员服务

21+阅读 · 2025年4月28日

GRAPH-BERT ：学习图表示只需要注意力，GRAPH-BERT : Only Attention is Needed for Learning Graph Representations

GRAPH-BERT ：学习图表示只需要注意力，GRAPH-BERT : Only Attention is Needed for Learning Graph Representations

专知会员服务

78+阅读 · 2020年5月31日

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

专知会员服务

195+阅读 · 2020年5月31日

【CVPR2020-香港中文大学】PointGroup:用于3D实例分割的双设置点分组，PointGroup: Dual-Set Point Grouping for 3D Instance Segmentation

【CVPR2020-香港中文大学】PointGroup:用于3D实例分割的双设置点分组，PointGroup: Dual-Set Point Grouping for 3D Instance Segmentation

专知会员服务

12+阅读 · 2020年4月6日

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

专知会员服务

27+阅读 · 2020年4月5日

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

专知会员服务

21+阅读 · 2020年3月28日

【AAAI2020】Context-Transformer:上下文转换器:解决对象混淆的小样本检测，Context-Transformer: Tackling Object Confusion for Few-Shot Detection

【AAAI2020】Context-Transformer:上下文转换器:解决对象混淆的小样本检测，Context-Transformer: Tackling Object Confusion for Few-Shot Detection

专知会员服务

51+阅读 · 2020年3月17日

【WWW2020-MAGNN】异质图嵌入的集合图神经网络 MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding

【WWW2020-MAGNN】异质图嵌入的集合图神经网络 MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding

专知会员服务

116+阅读 · 2020年2月10日

【Google无监督大规模视觉表示迁移】Large Scale Learning of General Visual Representations for Transfer

【Google无监督大规模视觉表示迁移】Large Scale Learning of General Visual Representations for Transfer

专知会员服务

12+阅读 · 2020年1月7日

TKDE 2020 | 面向严格冷启动推荐的属性图神经网络

TKDE 2020 | 面向严格冷启动推荐的属性图神经网络

PaperWeekly

13+阅读 · 2020年12月18日

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图与推荐

10+阅读 · 2020年3月28日

KGCN：使用TensorFlow进行知识图谱的机器学习

KGCN：使用TensorFlow进行知识图谱的机器学习

专知

16+阅读 · 2019年8月4日

误差反向传播——CNN

误差反向传播——CNN

统计学习与视觉计算组

31+阅读 · 2018年7月12日

【推荐系统论文笔记】DKN: 基于深度知识感知的新闻推荐网络（WWW2018 ）

【推荐系统论文笔记】DKN: 基于深度知识感知的新闻推荐网络（WWW2018 ）

专知

18+阅读 · 2018年4月2日

机器翻译新时代：Facebook 开源无监督机器翻译模型和大规模训练语料

机器翻译新时代：Facebook 开源无监督机器翻译模型和大规模训练语料

机器学习研究会

12+阅读 · 2017年12月24日

斯坦福Jure Leskovec图表示学习：无监督和有监督方法（附PPT下载）

斯坦福Jure Leskovec图表示学习：无监督和有监督方法（附PPT下载）

专知

24+阅读 · 2017年12月17日

SSD: Single Shot MultiBox Detector 深度学习笔记之SSD物体检测模型

SSD: Single Shot MultiBox Detector 深度学习笔记之SSD物体检测模型

AI研习社

18+阅读 · 2017年8月31日

神经网络机器翻译原理：LSTM、seq2seq到Zero-Shot

神经网络机器翻译原理：LSTM、seq2seq到Zero-Shot

北京思腾合力科技有限公司

11+阅读 · 2017年8月10日

语义分割中的深度学习方法全解：从FCN、SegNet到DeepLab

语义分割中的深度学习方法全解：从FCN、SegNet到DeepLab

炼数成金订阅号

26+阅读 · 2017年7月10日

面向快速油藏历史拟合的粒子群算法研究

国家自然科学基金

4+阅读 · 2015年12月31日

基于深层特征学习的RGB-D人体行为识别方法

国家自然科学基金

4+阅读 · 2015年12月31日

T-S模糊神经网络的容错同步性分析

国家自然科学基金

0+阅读 · 2015年12月31日

非确定型Web服务流程重组的可靠性验证技术

国家自然科学基金

1+阅读 · 2015年12月31日

数据中心资源利用率敏感的编译方法

国家自然科学基金

0+阅读 · 2015年12月31日

复杂非完整多自主体网络协同算法设计与性能极限分析

国家自然科学基金

1+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

云存储中无证书可证明数据持有方案关键技术研究

国家自然科学基金

0+阅读 · 2015年12月31日

Forward-Looking与Backward-Looking相结合的投资组合管理

国家自然科学基金

1+阅读 · 2014年12月31日

基于免疫的Rootkit隐遁攻击动态内存取证方法研究

国家自然科学基金

0+阅读 · 2014年12月31日

ICON: Intent-Context Coupling for Efficient Multi-Turn Jailbreak Attack

Arxiv

0+阅读 · 1月28日

One Word is Enough: Minimal Adversarial Perturbations for Neural Text Ranking

Arxiv

0+阅读 · 1月28日

Whitespaces Don't Lie: Feature-Driven and Embedding-Based Approaches for Detecting Machine-Generated Code

Arxiv

0+阅读 · 1月27日

GCFX: Generative Counterfactual Explanations for Deep Graph Models at the Model Level

Arxiv

0+阅读 · 1月26日

ToxSearch: Evolving Prompts for Toxicity Search in Large Language Models

Arxiv

0+阅读 · 1月24日

Unveiling the Black Box: A Multi-Layer Framework for Explaining Reinforcement Learning-Based Cyber Agents

Arxiv

0+阅读 · 1月24日

Scaling Ambiguity: Augmenting Human Annotation in Speech Emotion Recognition with Audio-Language Models

Arxiv

0+阅读 · 1月21日

ChatAD: Reasoning-Enhanced Time-Series Anomaly Detection with Multi-Turn Instruction Evolution

Arxiv

0+阅读 · 1月20日

CausAdv: A Causal-based Framework for Detecting Adversarial Examples

Arxiv

0+阅读 · 1月17日

UEChecker: Detecting Unchecked External Call Vulnerabilities in DApps via Graph Analysis

Arxiv

0+阅读 · 1月15日

VIP会员

文章信息

相关主题

相关VIP内容

【ICML2025】隐私防护图像压缩：防御视觉-语言预训练模型的滥用

【ICML2025】隐私防护图像压缩：防御视觉-语言预训练模型的滥用

专知会员服务

5+阅读 · 2025年6月22日

大语言模型越狱攻击：模型、根因及其攻防演化

大语言模型越狱攻击：模型、根因及其攻防演化

专知会员服务

21+阅读 · 2025年4月28日

GRAPH-BERT ：学习图表示只需要注意力，GRAPH-BERT : Only Attention is Needed for Learning Graph Representations

GRAPH-BERT ：学习图表示只需要注意力，GRAPH-BERT : Only Attention is Needed for Learning Graph Representations

专知会员服务

78+阅读 · 2020年5月31日

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

专知会员服务

195+阅读 · 2020年5月31日

【CVPR2020-香港中文大学】PointGroup:用于3D实例分割的双设置点分组，PointGroup: Dual-Set Point Grouping for 3D Instance Segmentation

【CVPR2020-香港中文大学】PointGroup:用于3D实例分割的双设置点分组，PointGroup: Dual-Set Point Grouping for 3D Instance Segmentation

专知会员服务

12+阅读 · 2020年4月6日

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

专知会员服务

27+阅读 · 2020年4月5日

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

专知会员服务

21+阅读 · 2020年3月28日

【AAAI2020】Context-Transformer:上下文转换器:解决对象混淆的小样本检测，Context-Transformer: Tackling Object Confusion for Few-Shot Detection

【AAAI2020】Context-Transformer:上下文转换器:解决对象混淆的小样本检测，Context-Transformer: Tackling Object Confusion for Few-Shot Detection

专知会员服务

51+阅读 · 2020年3月17日

【WWW2020-MAGNN】异质图嵌入的集合图神经网络 MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding

【WWW2020-MAGNN】异质图嵌入的集合图神经网络 MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding

专知会员服务

116+阅读 · 2020年2月10日

【Google无监督大规模视觉表示迁移】Large Scale Learning of General Visual Representations for Transfer

【Google无监督大规模视觉表示迁移】Large Scale Learning of General Visual Representations for Transfer

专知会员服务

12+阅读 · 2020年1月7日

热门VIP内容

开通专知VIP会员享更多权益服务

论学习、公平性与复杂度

《整合杀伤链：一个用于边缘目标验证与战术推理的零样本框架》最新资料

2025中国人工智能学会系列白皮书⸺棋盘上的人工智能|附下载

通用智能体评估的逻辑架构

相关资讯

TKDE 2020 | 面向严格冷启动推荐的属性图神经网络

TKDE 2020 | 面向严格冷启动推荐的属性图神经网络

PaperWeekly

13+阅读 · 2020年12月18日

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图与推荐

10+阅读 · 2020年3月28日

KGCN：使用TensorFlow进行知识图谱的机器学习

KGCN：使用TensorFlow进行知识图谱的机器学习

专知

16+阅读 · 2019年8月4日

误差反向传播——CNN

误差反向传播——CNN

统计学习与视觉计算组

31+阅读 · 2018年7月12日

【推荐系统论文笔记】DKN: 基于深度知识感知的新闻推荐网络（WWW2018 ）

【推荐系统论文笔记】DKN: 基于深度知识感知的新闻推荐网络（WWW2018 ）

专知

18+阅读 · 2018年4月2日

机器翻译新时代：Facebook 开源无监督机器翻译模型和大规模训练语料

机器翻译新时代：Facebook 开源无监督机器翻译模型和大规模训练语料

机器学习研究会

12+阅读 · 2017年12月24日

斯坦福Jure Leskovec图表示学习：无监督和有监督方法（附PPT下载）

斯坦福Jure Leskovec图表示学习：无监督和有监督方法（附PPT下载）

专知

24+阅读 · 2017年12月17日

SSD: Single Shot MultiBox Detector 深度学习笔记之SSD物体检测模型

SSD: Single Shot MultiBox Detector 深度学习笔记之SSD物体检测模型

AI研习社

18+阅读 · 2017年8月31日

神经网络机器翻译原理：LSTM、seq2seq到Zero-Shot

神经网络机器翻译原理：LSTM、seq2seq到Zero-Shot

北京思腾合力科技有限公司

11+阅读 · 2017年8月10日

语义分割中的深度学习方法全解：从FCN、SegNet到DeepLab

语义分割中的深度学习方法全解：从FCN、SegNet到DeepLab

炼数成金订阅号

26+阅读 · 2017年7月10日

相关论文

ICON: Intent-Context Coupling for Efficient Multi-Turn Jailbreak Attack

Arxiv

0+阅读 · 1月28日

One Word is Enough: Minimal Adversarial Perturbations for Neural Text Ranking

Arxiv

0+阅读 · 1月28日

Whitespaces Don't Lie: Feature-Driven and Embedding-Based Approaches for Detecting Machine-Generated Code

Arxiv

0+阅读 · 1月27日

GCFX: Generative Counterfactual Explanations for Deep Graph Models at the Model Level

Arxiv

0+阅读 · 1月26日

ToxSearch: Evolving Prompts for Toxicity Search in Large Language Models

Arxiv

0+阅读 · 1月24日

Unveiling the Black Box: A Multi-Layer Framework for Explaining Reinforcement Learning-Based Cyber Agents

Arxiv

0+阅读 · 1月24日

Scaling Ambiguity: Augmenting Human Annotation in Speech Emotion Recognition with Audio-Language Models

Arxiv

0+阅读 · 1月21日

ChatAD: Reasoning-Enhanced Time-Series Anomaly Detection with Multi-Turn Instruction Evolution

Arxiv

0+阅读 · 1月20日

CausAdv: A Causal-based Framework for Detecting Adversarial Examples

Arxiv

0+阅读 · 1月17日

UEChecker: Detecting Unchecked External Call Vulnerabilities in DApps via Graph Analysis

Arxiv

0+阅读 · 1月15日

相关基金

面向快速油藏历史拟合的粒子群算法研究

国家自然科学基金

4+阅读 · 2015年12月31日

基于深层特征学习的RGB-D人体行为识别方法

国家自然科学基金

4+阅读 · 2015年12月31日

T-S模糊神经网络的容错同步性分析

国家自然科学基金

0+阅读 · 2015年12月31日

非确定型Web服务流程重组的可靠性验证技术

国家自然科学基金

1+阅读 · 2015年12月31日

数据中心资源利用率敏感的编译方法

国家自然科学基金

0+阅读 · 2015年12月31日

复杂非完整多自主体网络协同算法设计与性能极限分析

国家自然科学基金

1+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

云存储中无证书可证明数据持有方案关键技术研究

国家自然科学基金

0+阅读 · 2015年12月31日

Forward-Looking与Backward-Looking相结合的投资组合管理

国家自然科学基金

1+阅读 · 2014年12月31日

基于免疫的Rootkit隐遁攻击动态内存取证方法研究

国家自然科学基金

0+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员