隐于明文：大语言模型中隐写合谋的出现与缓解 (Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs) - 专知论文

会员服务 ·

0

隐写 · 语言模型 · 隐写术 · 智能体 · 大语言模型 ·

2025 年 12 月 2 日

Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs

翻译：隐于明文：大语言模型中隐写合谋的出现与缓解

Yohan Mathew,Ollie Matthews,Robert McCarthy,Joan Velja,Christian Schroeder de Witt,Dylan Cope,Nandi Schoots

from arxiv, Camera-ready version. Oral presentation at IJCNLP-AACL 2025 (14th International Joint Conference on Natural Language Processing and 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics), Mumbai, India, December 20-24, 2025

The rapid proliferation of frontier model agents promises significant societal advances but also raises concerns about systemic risks arising from unsafe interactions. Collusion to the disadvantage of others has been identified as a central form of undesirable agent cooperation. The use of information hiding (steganography) in agent communications could render such collusion practically undetectable. This underscores the need for investigations into the possibility of such behaviours emerging and the robustness corresponding countermeasures. To investigate this problem we design two approaches -- a gradient-based reinforcement learning (GBRL) method and an in-context reinforcement learning (ICRL) method -- for reliably eliciting sophisticated LLM-generated linguistic text steganography. We demonstrate, for the first time, that unintended steganographic collusion in LLMs can arise due to mispecified reward incentives during training. Additionally, we find that standard mitigations -- both passive oversight of model outputs and active mitigation through communication paraphrasing -- are not fully effective at preventing this steganographic communication. Our findings imply that (i) emergence of steganographic collusion is a plausible concern that should be monitored and researched, and (ii) preventing emergence may require innovation in mitigation techniques.

翻译：前沿模型智能体的快速扩散有望带来显著的社会进步，但也引发了对其不安全交互可能产生系统性风险的担忧。以损害他人利益为目的的合谋已被确定为不良智能体协作的核心形式。在智能体通信中使用信息隐藏（隐写术）可能使此类合谋在实际中难以被检测。这突显了对此类行为出现的可能性及相应应对措施鲁棒性进行研究的必要性。为探究此问题，我们设计了两种方法——基于梯度的强化学习（GBRL）方法和上下文内强化学习（ICRL）方法——以可靠地诱导出复杂的大语言模型生成的语言文本隐写术。我们首次证明，大语言模型中非预期的隐写合谋可能源于训练期间奖励激励的错误设定。此外，我们发现标准缓解措施——无论是被动监督模型输出，还是通过通信释义进行主动缓解——均无法完全有效阻止这种隐写通信。我们的研究结果表明：（i）隐写合谋的出现是一个值得关注并需持续监测与研究的问题；（ii）防止其出现可能需要缓解技术的创新。

0

相关内容

【ICLR2025】为多模态图像-文本表示可解释性缩小信息瓶颈理论

【ICLR2025】为多模态图像-文本表示可解释性缩小信息瓶颈理论

专知会员服务

15+阅读 · 2025年2月24日

MME-Survey：多模态大型语言模型评估的综合性调查

MME-Survey：多模态大型语言模型评估的综合性调查

专知会员服务

43+阅读 · 2024年12月1日

【KDD2024】HiGPT:异构图语言模型

【KDD2024】HiGPT:异构图语言模型

专知会员服务

19+阅读 · 2024年7月9日

【CVPR2024】SNIFFER：用于可解释的脱离上下文谣言检测的多模态大型语言模型

【CVPR2024】SNIFFER：用于可解释的脱离上下文谣言检测的多模态大型语言模型

专知会员服务

19+阅读 · 2024年3月6日

【WSDM2024】数据中的恶魔：通过部分知识蒸馏学习公平的图神经网络

【WSDM2024】数据中的恶魔：通过部分知识蒸馏学习公平的图神经网络

专知会员服务

31+阅读 · 2023年12月1日

【超越消息传递:图神经网络的物理启发范式】Beyond Message Passing: a Physics-Inspired Paradigm for Graph Neural Networks

【超越消息传递:图神经网络的物理启发范式】Beyond Message Passing: a Physics-Inspired Paradigm for Graph Neural Networks

专知会员服务

17+阅读 · 2022年5月10日

【视觉和语言导航:任务、方法和未来方向的综述】Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

【视觉和语言导航:任务、方法和未来方向的综述】Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

专知会员服务

37+阅读 · 2022年3月25日

【多目标多智能体系统决策】196页PDF布鲁塞尔自由大学博士论文，Decision Making in Multi-Objective Multi-Agent Systems——A Utility-Based Perspective

【多目标多智能体系统决策】196页PDF布鲁塞尔自由大学博士论文，Decision Making in Multi-Objective Multi-Agent Systems——A Utility-Based Perspective

专知会员服务

118+阅读 · 2022年3月18日

【MIT-ICLR2022】在机器学习模型中注入公平性, Injecting fairness into machine-learning models

【MIT-ICLR2022】在机器学习模型中注入公平性, Injecting fairness into machine-learning models

专知会员服务

22+阅读 · 2022年3月7日

可解释强化学习，Explainable Reinforcement Learning: A Survey

可解释强化学习，Explainable Reinforcement Learning: A Survey

专知会员服务

132+阅读 · 2020年5月14日

最新最全《深度元学习》2021综述论文，68页pdf，A Survey of Deep Meta-Learning

最新最全《深度元学习》2021综述论文，68页pdf，A Survey of Deep Meta-Learning

专知

11+阅读 · 2021年4月23日

【KDD2020】动态知识图谱的多事件预测

【KDD2020】动态知识图谱的多事件预测

专知

88+阅读 · 2020年8月31日

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

专知

11+阅读 · 2020年8月28日

Kaggle知识点：伪标签Pseudo Label

Kaggle知识点：伪标签Pseudo Label

AINLP

40+阅读 · 2020年8月9日

论文浅尝 | GEOM-GCN: Geometric Graph Convolutional Networks

论文浅尝 | GEOM-GCN: Geometric Graph Convolutional Networks

开放知识图谱

14+阅读 · 2020年4月8日

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图与推荐

10+阅读 · 2020年3月28日

【阿里巴巴-WWW2020】对抗性多模态表示学习的点击率预测，Adversarial Multimodal RL

【阿里巴巴-WWW2020】对抗性多模态表示学习的点击率预测，Adversarial Multimodal RL

专知

11+阅读 · 2020年3月17日

【语义分割】一文概览主要语义分割网络：FCN,SegNet,U-Net...

【语义分割】一文概览主要语义分割网络：FCN,SegNet,U-Net...

产业智能官

18+阅读 · 2018年7月26日

论文笔记之Feature Selective Networks for Object Detection

论文笔记之Feature Selective Networks for Object Detection

统计学习与视觉计算组

21+阅读 · 2018年7月26日

LibRec 每周算法：LDA主题模型

LibRec 每周算法：LDA主题模型

LibRec智能推荐

29+阅读 · 2017年12月4日

语义Web知识库补全关键技术研究

国家自然科学基金

18+阅读 · 2017年12月31日

不确定数据流的分布并行Skyline查询技术研究

国家自然科学基金

1+阅读 · 2015年12月31日

满足差分隐私的频繁模式挖掘研究

国家自然科学基金

2+阅读 · 2015年12月31日

社交网络中的流言传播与演化

国家自然科学基金

2+阅读 · 2015年12月31日

SDN数据平面中大规模流表的高性能查找方法研究

国家自然科学基金

4+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

复杂网络上的广义传播过程溯源

国家自然科学基金

0+阅读 · 2015年12月31日

基于决策模型和预备电位的运动想象BCI研究

国家自然科学基金

3+阅读 · 2015年12月31日

大数据环境下基于GMDH的客户分类半监督集成模型研究

国家自然科学基金

1+阅读 · 2014年12月31日

含有隐变量的因果结构学习与统计因果推断

国家自然科学基金

21+阅读 · 2013年12月31日

DeepSeek-V3 Technical Report

Arxiv

18+阅读 · 2024年12月27日

Is ChatGPT a Good Recommender? A Preliminary Study

Arxiv

175+阅读 · 2023年4月20日

A Comprehensive Survey on Deep Graph Representation Learning

Arxiv

109+阅读 · 2023年4月11日

On Efficient Training of Large-Scale Deep Learning Models: A Literature Review

Arxiv

231+阅读 · 2023年4月7日

A Survey of Large Language Models

A Survey of Large Language Models

Arxiv

499+阅读 · 2023年3月31日

Unleashing the Power of Edge-Cloud Generative AI in Mobile Networks: A Survey of AIGC Services

Arxiv

154+阅读 · 2023年3月29日

ChatGPT is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models

Arxiv

64+阅读 · 2023年3月29日

Nature Language Reasoning, A Survey

Arxiv

83+阅读 · 2023年3月26日

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Arxiv

51+阅读 · 2023年3月22日

Data-centric Artificial Intelligence: A Survey

Arxiv

27+阅读 · 2023年3月17日

VIP会员

文章信息

相关主题

大语言模型

相关VIP内容

【ICLR2025】为多模态图像-文本表示可解释性缩小信息瓶颈理论

【ICLR2025】为多模态图像-文本表示可解释性缩小信息瓶颈理论

专知会员服务

15+阅读 · 2025年2月24日

MME-Survey：多模态大型语言模型评估的综合性调查

MME-Survey：多模态大型语言模型评估的综合性调查

专知会员服务

43+阅读 · 2024年12月1日

【KDD2024】HiGPT:异构图语言模型

【KDD2024】HiGPT:异构图语言模型

专知会员服务

19+阅读 · 2024年7月9日

【CVPR2024】SNIFFER：用于可解释的脱离上下文谣言检测的多模态大型语言模型

【CVPR2024】SNIFFER：用于可解释的脱离上下文谣言检测的多模态大型语言模型

专知会员服务

19+阅读 · 2024年3月6日

【WSDM2024】数据中的恶魔：通过部分知识蒸馏学习公平的图神经网络

【WSDM2024】数据中的恶魔：通过部分知识蒸馏学习公平的图神经网络

专知会员服务

31+阅读 · 2023年12月1日

【超越消息传递:图神经网络的物理启发范式】Beyond Message Passing: a Physics-Inspired Paradigm for Graph Neural Networks

【超越消息传递:图神经网络的物理启发范式】Beyond Message Passing: a Physics-Inspired Paradigm for Graph Neural Networks

专知会员服务

17+阅读 · 2022年5月10日

【视觉和语言导航:任务、方法和未来方向的综述】Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

【视觉和语言导航:任务、方法和未来方向的综述】Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

专知会员服务

37+阅读 · 2022年3月25日

【多目标多智能体系统决策】196页PDF布鲁塞尔自由大学博士论文，Decision Making in Multi-Objective Multi-Agent Systems——A Utility-Based Perspective

【多目标多智能体系统决策】196页PDF布鲁塞尔自由大学博士论文，Decision Making in Multi-Objective Multi-Agent Systems——A Utility-Based Perspective

专知会员服务

118+阅读 · 2022年3月18日

【MIT-ICLR2022】在机器学习模型中注入公平性, Injecting fairness into machine-learning models

【MIT-ICLR2022】在机器学习模型中注入公平性, Injecting fairness into machine-learning models

专知会员服务

22+阅读 · 2022年3月7日

可解释强化学习，Explainable Reinforcement Learning: A Survey

可解释强化学习，Explainable Reinforcement Learning: A Survey

专知会员服务

132+阅读 · 2020年5月14日

热门VIP内容

开通专知VIP会员享更多权益服务

论学习、公平性与复杂度

《整合杀伤链：一个用于边缘目标验证与战术推理的零样本框架》最新资料

2025中国人工智能学会系列白皮书⸺棋盘上的人工智能|附下载

通用智能体评估的逻辑架构

相关资讯

最新最全《深度元学习》2021综述论文，68页pdf，A Survey of Deep Meta-Learning

最新最全《深度元学习》2021综述论文，68页pdf，A Survey of Deep Meta-Learning

专知

11+阅读 · 2021年4月23日

【KDD2020】动态知识图谱的多事件预测

【KDD2020】动态知识图谱的多事件预测

专知

88+阅读 · 2020年8月31日

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

专知

11+阅读 · 2020年8月28日

Kaggle知识点：伪标签Pseudo Label

Kaggle知识点：伪标签Pseudo Label

AINLP

40+阅读 · 2020年8月9日

论文浅尝 | GEOM-GCN: Geometric Graph Convolutional Networks

论文浅尝 | GEOM-GCN: Geometric Graph Convolutional Networks

开放知识图谱

14+阅读 · 2020年4月8日

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图与推荐

10+阅读 · 2020年3月28日

【阿里巴巴-WWW2020】对抗性多模态表示学习的点击率预测，Adversarial Multimodal RL

【阿里巴巴-WWW2020】对抗性多模态表示学习的点击率预测，Adversarial Multimodal RL

专知

11+阅读 · 2020年3月17日

【语义分割】一文概览主要语义分割网络：FCN,SegNet,U-Net...

【语义分割】一文概览主要语义分割网络：FCN,SegNet,U-Net...

产业智能官

18+阅读 · 2018年7月26日

论文笔记之Feature Selective Networks for Object Detection

论文笔记之Feature Selective Networks for Object Detection

统计学习与视觉计算组

21+阅读 · 2018年7月26日

LibRec 每周算法：LDA主题模型

LibRec 每周算法：LDA主题模型

LibRec智能推荐

29+阅读 · 2017年12月4日

相关论文

DeepSeek-V3 Technical Report

Arxiv

18+阅读 · 2024年12月27日

Is ChatGPT a Good Recommender? A Preliminary Study

Arxiv

175+阅读 · 2023年4月20日

A Comprehensive Survey on Deep Graph Representation Learning

Arxiv

109+阅读 · 2023年4月11日

On Efficient Training of Large-Scale Deep Learning Models: A Literature Review

Arxiv

231+阅读 · 2023年4月7日

A Survey of Large Language Models

A Survey of Large Language Models

Arxiv

499+阅读 · 2023年3月31日

Unleashing the Power of Edge-Cloud Generative AI in Mobile Networks: A Survey of AIGC Services

Arxiv

154+阅读 · 2023年3月29日

ChatGPT is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models

Arxiv

64+阅读 · 2023年3月29日

Nature Language Reasoning, A Survey

Arxiv

83+阅读 · 2023年3月26日

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Arxiv

51+阅读 · 2023年3月22日

Data-centric Artificial Intelligence: A Survey

Arxiv

27+阅读 · 2023年3月17日

相关基金

语义Web知识库补全关键技术研究

国家自然科学基金

18+阅读 · 2017年12月31日

不确定数据流的分布并行Skyline查询技术研究

国家自然科学基金

1+阅读 · 2015年12月31日

满足差分隐私的频繁模式挖掘研究

国家自然科学基金

2+阅读 · 2015年12月31日

社交网络中的流言传播与演化

国家自然科学基金

2+阅读 · 2015年12月31日

SDN数据平面中大规模流表的高性能查找方法研究

国家自然科学基金

4+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

复杂网络上的广义传播过程溯源

国家自然科学基金

0+阅读 · 2015年12月31日

基于决策模型和预备电位的运动想象BCI研究

国家自然科学基金

3+阅读 · 2015年12月31日

大数据环境下基于GMDH的客户分类半监督集成模型研究

国家自然科学基金

1+阅读 · 2014年12月31日

含有隐变量的因果结构学习与统计因果推断

国家自然科学基金

21+阅读 · 2013年12月31日

微信扫码咨询专知VIP会员