预训练中何时应引入安全干预？ (When Should We Introduce Safety Interventions During Pretraining?) - 专知论文

会员服务 ·

0

预训练 · 鲁棒 · 不变 · 应用安全 · 词元分析器 ·

When Should We Introduce Safety Interventions During Pretraining?

翻译：预训练中何时应引入安全干预？

Dylan Sam,Sachin Goyal,Pratyush Maini,Alexander Robey,J. Zico Kolter

Prior work has shown that safety interventions applied during pretraining, such as removing and rephrasing harmful content, can substantially improve the robustness of the resulting models. In this paper, we study the fundamental question that prior work has overlooked: "When during pretraining should safety interventions be introduced?" We keep the underlying data sources and pretraining interventions fixed, varying the intervention start time (after 0%, 20%, or 60% of pretraining tokens). We find that the optimal start time is not one-size-fits-all: with standard top-k decoding, introducing interventions after a short initial phase of safe-only pretraining (20%-60%) often yields the strongest robustness, with the clearest benefits emerging after downstream, benign finetuning. In contrast, for safety-aware inference, interventions starting from the beginning improve steerability towards safer generations. Finally, we observe that earlier interventions reshape internal representations: linear probes more cleanly separate safe vs harmful examples. Our results are the first to establish intervention timing as a key curriculum design choice for safety.

翻译：先前的研究表明，在预训练期间应用安全干预（例如移除和重述有害内容）可以显著提升所得模型的鲁棒性。本文研究了一个先前工作忽视的基本问题：“安全干预应在预训练的哪个阶段引入？”我们保持底层数据源和预训练干预措施不变，仅改变干预的起始时间（在预训练完成0%、20%或60%的token后）。我们发现最优起始时间并非一成不变：在使用标准top-k解码的情况下，在仅使用安全数据进行短暂初始预训练（20%-60%）后引入干预，通常能产生最强的鲁棒性，其最显著的效益在下游良性微调后显现。相比之下，对于安全感知推理，从一开始就引入干预能提高生成内容朝向更安全方向的引导性。最后，我们观察到较早的干预会重塑内部表征：线性探针能更清晰地区分安全与有害示例。我们的研究首次确立了干预时机作为安全课程设计中的一个关键选择。

0

相关内容

预训练

在搭建网络模型时，需要随机初始化参数，然后开始训练网络，不断调整直到网络的损失越来越小。在训练的过程中，一开始初始化的参数会不断变化。当参数训练到比较好的时候就可以将训练模型的参数保存下来，以便训练好的模型可以在下次执行类似任务时获得较好的结果。

预训练模型的新兴安全与隐私问题：综述与展望

预训练模型的新兴安全与隐私问题：综述与展望

专知会员服务

20+阅读 · 2024年11月13日

时序挖掘如何预训练？华南理工最新《时间序列预训练模型》综述，29页pdf详述时序预训练方法体系

时序挖掘如何预训练？华南理工最新《时间序列预训练模型》综述，29页pdf详述时序预训练方法体系

专知会员服务

85+阅读 · 2023年5月22日

预训练语言模型的应用综述

预训练语言模型的应用综述

专知会员服务

36+阅读 · 2023年1月23日

【西湖大学】图预训练方法体系综述，A Survey of Pre-training on Graphs: Taxonomy, Methods and Applications

【西湖大学】图预训练方法体系综述，A Survey of Pre-training on Graphs: Taxonomy, Methods and Applications

专知会员服务

43+阅读 · 2022年3月25日

【清华大学】Delta调优:预训练语言模型参数有效方法的综合研究，Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models

【清华大学】Delta调优:预训练语言模型参数有效方法的综合研究，Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models

专知会员服务

26+阅读 · 2022年3月15日

预训练语言模型fine-tuning近期进展概述

预训练语言模型fine-tuning近期进展概述

专知会员服务

40+阅读 · 2021年4月9日

自然语言处理预训练模型的研究综述

专知会员服务

123+阅读 · 2020年12月9日

【微软亚研】预训练文本表示作为元学习，Pre-training Text Representations

【微软亚研】预训练文本表示作为元学习，Pre-training Text Representations

专知会员服务

40+阅读 · 2020年4月17日

【斯坦福大学-ICLR2020】图神经网络预训练的策略，Strategies for Pre-training Graph Neural Networks

【斯坦福大学-ICLR2020】图神经网络预训练的策略，Strategies for Pre-training Graph Neural Networks

专知会员服务

78+阅读 · 2020年3月1日

【中科院自动化所】序列到序列语音识别的无监督预训练（Unsupervised pre-training for sequence to sequence speech recognition）

【中科院自动化所】序列到序列语音识别的无监督预训练（Unsupervised pre-training for sequence to sequence speech recognition）

专知会员服务

33+阅读 · 2020年1月5日

「知识增强预训练语言模型」最新研究综述

「知识增强预训练语言模型」最新研究综述

专知

18+阅读 · 2022年11月18日

ICML2020 图神经网络的预训练

ICML2020 图神经网络的预训练

图与推荐

12+阅读 · 2020年4月4日

【复旦大学】最新《预训练语言模型》2020综述论文大全，50+PTMs分类体系，25页pdf205篇参考文献

【复旦大学】最新《预训练语言模型》2020综述论文大全，50+PTMs分类体系，25页pdf205篇参考文献

专知

22+阅读 · 2020年3月19日

最新必读【预训练语言模型(BERT/XLNet等)】论文，Google/微软/华为ICLR2020提交论文

最新必读【预训练语言模型(BERT/XLNet等)】论文，Google/微软/华为ICLR2020提交论文

专知

36+阅读 · 2019年9月29日

一大批中文（BERT等）预训练模型等你认领！

一大批中文（BERT等）预训练模型等你认领！

PaperWeekly

15+阅读 · 2019年6月25日

【预测性维护】从观望到涉足，如何开发一个预测性维护系统？

【预测性维护】从观望到涉足，如何开发一个预测性维护系统？

产业智能官

20+阅读 · 2019年5月18日

BERT-预训练的强大

BERT-预训练的强大

微信AI

61+阅读 · 2019年3月7日

自然语言处理中的深度迁移学习——文本预训练

自然语言处理中的深度迁移学习——文本预训练

专知

16+阅读 · 2018年12月10日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

自然语言处理中的语言模型预训练方法

自然语言处理中的语言模型预训练方法

PaperWeekly

14+阅读 · 2018年10月21日

亚健康理论方法在空管运行安全管理中的应用研究

国家自然科学基金

0+阅读 · 2015年12月31日

面向多主体的应急预案体系语义模型研究

国家自然科学基金

1+阅读 · 2015年12月31日

面向安全关键系统的时间可预测多核代码生成方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

面向物理层安全的高能-谱效率协作干扰机理及方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于安全需求分析的内核保护方法研究

国家自然科学基金

2+阅读 · 2015年12月31日

复杂需求场景驱动的软件安全防护模型检测技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

社会性预期优势效应的神经机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于公立医院动态人本化管理的医患冲突预警和干预模式构建

国家自然科学基金

1+阅读 · 2014年12月31日

基于自适应模型检测的安全协议自动建模与设计研究

国家自然科学基金

1+阅读 · 2014年12月31日

隐写模糊安全性测度及其优化嵌入算法研究

国家自然科学基金

0+阅读 · 2014年12月31日

Mitigating the Safety-utility Trade-off in LLM Alignment via Adaptive Safe Context Learning

Arxiv

0+阅读 · 2月14日

Safety Beyond the Training Data: Robust Out-of-Distribution MPC via Conformalized System Level Synthesis

Arxiv

0+阅读 · 2月12日

From Data to Behavior: Predicting Unintended Model Behaviors Before Training

Arxiv

0+阅读 · 2月4日

Robust Intervention Learning from Emergency Stop Interventions

Robust Intervention Learning from Emergency Stop Interventions

Arxiv

0+阅读 · 2月3日

If It's Nice, Do It Twice: We Should Try Iterative Corpus Curation

Arxiv

0+阅读 · 2月3日

Midtraining Bridges Pretraining and Posttraining Distributions

Arxiv

0+阅读 · 2月2日

Learning Robust Intervention Representations with Delta Embeddings

Arxiv

0+阅读 · 2月2日

Expected Harm: Rethinking Safety Evaluation of (Mis)Aligned LLMs

Arxiv

0+阅读 · 2月2日

Evaluating Large Language Models for Security Bug Report Prediction

Arxiv

0+阅读 · 1月30日

Warm Up Before You Train: Unlocking General Reasoning in Resource-Constrained Settings

Arxiv

0+阅读 · 1月30日

VIP会员

文章信息

相关主题

词元分析器

相关VIP内容

预训练模型的新兴安全与隐私问题：综述与展望

预训练模型的新兴安全与隐私问题：综述与展望

专知会员服务

20+阅读 · 2024年11月13日

时序挖掘如何预训练？华南理工最新《时间序列预训练模型》综述，29页pdf详述时序预训练方法体系

时序挖掘如何预训练？华南理工最新《时间序列预训练模型》综述，29页pdf详述时序预训练方法体系

专知会员服务

85+阅读 · 2023年5月22日

预训练语言模型的应用综述

预训练语言模型的应用综述

专知会员服务

36+阅读 · 2023年1月23日

【西湖大学】图预训练方法体系综述，A Survey of Pre-training on Graphs: Taxonomy, Methods and Applications

【西湖大学】图预训练方法体系综述，A Survey of Pre-training on Graphs: Taxonomy, Methods and Applications

专知会员服务

43+阅读 · 2022年3月25日

【清华大学】Delta调优:预训练语言模型参数有效方法的综合研究，Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models

【清华大学】Delta调优:预训练语言模型参数有效方法的综合研究，Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models

专知会员服务

26+阅读 · 2022年3月15日

预训练语言模型fine-tuning近期进展概述

预训练语言模型fine-tuning近期进展概述

专知会员服务

40+阅读 · 2021年4月9日

自然语言处理预训练模型的研究综述

专知会员服务

123+阅读 · 2020年12月9日

【微软亚研】预训练文本表示作为元学习，Pre-training Text Representations

【微软亚研】预训练文本表示作为元学习，Pre-training Text Representations

专知会员服务

40+阅读 · 2020年4月17日

【斯坦福大学-ICLR2020】图神经网络预训练的策略，Strategies for Pre-training Graph Neural Networks

【斯坦福大学-ICLR2020】图神经网络预训练的策略，Strategies for Pre-training Graph Neural Networks

专知会员服务

78+阅读 · 2020年3月1日

【中科院自动化所】序列到序列语音识别的无监督预训练（Unsupervised pre-training for sequence to sequence speech recognition）

【中科院自动化所】序列到序列语音识别的无监督预训练（Unsupervised pre-training for sequence to sequence speech recognition）

专知会员服务

33+阅读 · 2020年1月5日

热门VIP内容

开通专知VIP会员享更多权益服务

智能体记忆深度剖析：评价指标与系统局限性的分类体系及实证分析

《可信人工智能赋能系统的支柱》

【CMU博士论文】可靠轨迹预测的分层基石：数据、评估与方法

人工智能赋能边缘与自主系统：美陆军现代化进程聚焦威胁探测与战术边缘情报

相关资讯

「知识增强预训练语言模型」最新研究综述

「知识增强预训练语言模型」最新研究综述

专知

18+阅读 · 2022年11月18日

ICML2020 图神经网络的预训练

ICML2020 图神经网络的预训练

图与推荐

12+阅读 · 2020年4月4日

【复旦大学】最新《预训练语言模型》2020综述论文大全，50+PTMs分类体系，25页pdf205篇参考文献

【复旦大学】最新《预训练语言模型》2020综述论文大全，50+PTMs分类体系，25页pdf205篇参考文献

专知

22+阅读 · 2020年3月19日

最新必读【预训练语言模型(BERT/XLNet等)】论文，Google/微软/华为ICLR2020提交论文

最新必读【预训练语言模型(BERT/XLNet等)】论文，Google/微软/华为ICLR2020提交论文

专知

36+阅读 · 2019年9月29日

一大批中文（BERT等）预训练模型等你认领！

一大批中文（BERT等）预训练模型等你认领！

PaperWeekly

15+阅读 · 2019年6月25日

【预测性维护】从观望到涉足，如何开发一个预测性维护系统？

【预测性维护】从观望到涉足，如何开发一个预测性维护系统？

产业智能官

20+阅读 · 2019年5月18日

BERT-预训练的强大

BERT-预训练的强大

微信AI

61+阅读 · 2019年3月7日

自然语言处理中的深度迁移学习——文本预训练

自然语言处理中的深度迁移学习——文本预训练

专知

16+阅读 · 2018年12月10日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

自然语言处理中的语言模型预训练方法

自然语言处理中的语言模型预训练方法

PaperWeekly

14+阅读 · 2018年10月21日

相关论文

Mitigating the Safety-utility Trade-off in LLM Alignment via Adaptive Safe Context Learning

Arxiv

0+阅读 · 2月14日

Safety Beyond the Training Data: Robust Out-of-Distribution MPC via Conformalized System Level Synthesis

Arxiv

0+阅读 · 2月12日

From Data to Behavior: Predicting Unintended Model Behaviors Before Training

Arxiv

0+阅读 · 2月4日

Robust Intervention Learning from Emergency Stop Interventions

Robust Intervention Learning from Emergency Stop Interventions

Arxiv

0+阅读 · 2月3日

If It's Nice, Do It Twice: We Should Try Iterative Corpus Curation

Arxiv

0+阅读 · 2月3日

Midtraining Bridges Pretraining and Posttraining Distributions

Arxiv

0+阅读 · 2月2日

Learning Robust Intervention Representations with Delta Embeddings

Arxiv

0+阅读 · 2月2日

Expected Harm: Rethinking Safety Evaluation of (Mis)Aligned LLMs

Arxiv

0+阅读 · 2月2日

Evaluating Large Language Models for Security Bug Report Prediction

Arxiv

0+阅读 · 1月30日

Warm Up Before You Train: Unlocking General Reasoning in Resource-Constrained Settings

Arxiv

0+阅读 · 1月30日

相关基金

亚健康理论方法在空管运行安全管理中的应用研究

国家自然科学基金

0+阅读 · 2015年12月31日

面向多主体的应急预案体系语义模型研究

国家自然科学基金

1+阅读 · 2015年12月31日

面向安全关键系统的时间可预测多核代码生成方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

面向物理层安全的高能-谱效率协作干扰机理及方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于安全需求分析的内核保护方法研究

国家自然科学基金

2+阅读 · 2015年12月31日

复杂需求场景驱动的软件安全防护模型检测技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

社会性预期优势效应的神经机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于公立医院动态人本化管理的医患冲突预警和干预模式构建

国家自然科学基金

1+阅读 · 2014年12月31日

基于自适应模型检测的安全协议自动建模与设计研究

国家自然科学基金

1+阅读 · 2014年12月31日

隐写模糊安全性测度及其优化嵌入算法研究

国家自然科学基金

0+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员