Asymmetric Goal Drift in Coding Agents Under Value Conflict - 专知论文

会员服务 ·

0

Agent · 代码 · 约束 · 回合 · Learning ·

Asymmetric Goal Drift in Coding Agents Under Value Conflict

翻译：暂无翻译

Magnus Saebo,Spencer Gibson,Tyler Crosse,Achyutha Menon,Eyon Jang,Diogo Cruz

from arxiv, 5 pages, 4 figures, Published as a workshop paper in Lifelong Agents @ ICLR 2026

Coding agents are increasingly deployed autonomously, at scale, and over long-context horizons. To be effective and safe, these agents must navigate complex trade-offs in deployment, balancing influence from the user, their learned values, and the codebase itself. Understanding how agents resolve these trade-offs in practice is critical, yet prior work has relied on static, synthetic settings that do not capture the complexity of real-world environments. To this end, we introduce a framework built on OpenCode in which a coding agent completes realistic, multi-step tasks under a system prompt constraint favoring one side of a value trade-off. We measure how often the agent violates this constraint as it completes tasks, with and without environmental pressure toward the competing value. Using this framework, we demonstrate that GPT-5 mini, Haiku 4.5, and Grok Code Fast 1 exhibit $\textit{asymmetric drift}$: they are more likely to violate their system prompt when its constraint opposes strongly-held values like security and privacy. We find for the models and values tested that goal drift correlates with three compounding factors: value alignment, adversarial pressure, and accumulated context. However, even constraints aligned with strongly-held values like privacy are violated under sustained environmental pressure for some models. Our findings reveal that shallow compliance checks are insufficient, and that environmental signals can override explicit constraints in ways that appear exploitable. Malicious actors with access to the codebase could manipulate agent behavior by appealing to learned values, with the risk compounding over the long horizons typical of agentic deployment.

翻译：暂无翻译

0

相关内容

Agent

最新新Agent综述！76页327篇论文梳理，北交大桑基韬教授团队发布《迈向模型原生智能体式人工智能的范式转变综述》

最新新Agent综述！76页327篇论文梳理，北交大桑基韬教授团队发布《迈向模型原生智能体式人工智能的范式转变综述》

专知会员服务

40+阅读 · 2025年10月17日

Agent有望定义万亿劳动力市场

Agent有望定义万亿劳动力市场

专知会员服务

19+阅读 · 2025年6月11日

AI行业专题报告：工具生态逐步完善，通用Agent曙光已现

AI行业专题报告：工具生态逐步完善，通用Agent曙光已现

专知会员服务

32+阅读 · 2025年3月27日

Agent视域下的人工智能赋能作战系统

Agent视域下的人工智能赋能作战系统

专知会员服务

57+阅读 · 2024年12月15日

大模型安全性，Google DeepMind Nicholas Carlini，附191页slides与视频

大模型安全性，Google DeepMind Nicholas Carlini，附191页slides与视频

专知会员服务

31+阅读 · 2024年7月15日

Al Agent--大模型时代重要落地方向

Al Agent--大模型时代重要落地方向

专知会员服务

106+阅读 · 2024年4月8日

数字世界中的大模型Agent：机遇与风险

数字世界中的大模型Agent：机遇与风险

专知会员服务

60+阅读 · 2023年12月25日

作战 Agent 的学习算法研究进展与发展趋势

作战 Agent 的学习算法研究进展与发展趋势

专知会员服务

71+阅读 · 2023年10月3日

【CVPR 2022】深度安全多视图聚类:降低因视图增加而导致聚类性能下降的风险，Deep Safe Multi-view Clustering: Reducing the Risk of Clustering Performance Degradation Caused by View Increase

【CVPR 2022】深度安全多视图聚类:降低因视图增加而导致聚类性能下降的风险，Deep Safe Multi-view Clustering: Reducing the Risk of Clustering Performance Degradation Caused by View Increase

专知会员服务

10+阅读 · 2022年3月12日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

初学者系列：Attentional Factorization Machines（AFM）详解

初学者系列：Attentional Factorization Machines（AFM）详解

专知

82+阅读 · 2019年9月16日

【论文】Awesome Relation Extraction Paper（关系抽取）（PART V）

【论文】Awesome Relation Extraction Paper（关系抽取）（PART V）

AINLP

38+阅读 · 2019年9月3日

【论文】Awesome Relation Extraction Paper（关系抽取）（PART III）

【论文】Awesome Relation Extraction Paper（关系抽取）（PART III）

AINLP

25+阅读 · 2019年8月21日

【泡泡图灵智库】Detect-SLAM：目标检测和SLAM相互收益

【泡泡图灵智库】Detect-SLAM：目标检测和SLAM相互收益

泡泡机器人SLAM

14+阅读 · 2019年6月28日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

深度学习中Attention Mechanism详细介绍：原理、分类及应用

深度学习中Attention Mechanism详细介绍：原理、分类及应用

深度学习与NLP

10+阅读 · 2019年2月18日

用 LDA 和 LSA 两种方法来降维和做 Topic 建模

用 LDA 和 LSA 两种方法来降维和做 Topic 建模

AI研习社

13+阅读 · 2018年8月24日

Focal Loss for Dense Object Detection

Focal Loss for Dense Object Detection

统计学习与视觉计算组

12+阅读 · 2018年3月15日

模型汇总24 - 深度学习中Attention Mechanism详细介绍：原理、分类及应用

模型汇总24 - 深度学习中Attention Mechanism详细介绍：原理、分类及应用

深度学习与NLP

12+阅读 · 2017年11月30日

白翔：趣谈“捕文捉字”-- 场景文字检测 | VALSE2017之十

白翔：趣谈“捕文捉字”-- 场景文字检测 | VALSE2017之十

深度学习大讲堂

19+阅读 · 2017年9月4日

局部腐蚀航空铝合金损伤失效过程多尺度实验研究

国家自然科学基金

1+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

47+阅读 · 2015年12月31日

自我损耗对工作场所安全绩效的影响及缓解途径

国家自然科学基金

0+阅读 · 2015年12月31日

基于时间反演的结构界面损伤监测理论及试验研究

国家自然科学基金

0+阅读 · 2014年12月31日

面向大数据的城市地下工程施工期安全风险评估方法研究

国家自然科学基金

0+阅读 · 2014年12月31日

嵌入视角下建筑工人职业流动问题的行为仿真及对策评价

国家自然科学基金

0+阅读 · 2014年12月31日

地铁施工安全风险时空耦合机理及实景仿真预警技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

预制装配型钢混凝土梁受力行为与设计方法研究

国家自然科学基金

0+阅读 · 2014年12月31日

自动化集装箱码头装卸作业的时空同步策略与优化方法

国家自然科学基金

1+阅读 · 2014年12月31日

面向人与Agent混合的多团队协作仿真训练方法研究

国家自然科学基金

19+阅读 · 2012年12月31日

Can Coding Agents Reproduce Findings in Computational Materials Science?

Arxiv

0+阅读 · 5月1日

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

Arxiv

0+阅读 · 4月29日

ReCreate: Reasoning and Creating Domain Agents Driven by Experience

Arxiv

0+阅读 · 4月28日

Evaluating Plan Compliance in Autonomous Programming Agents

Arxiv

0+阅读 · 4月28日

SWE-chat: Coding Agent Interactions From Real Users in the Wild

Arxiv

0+阅读 · 4月22日

Evaluating Privilege Usage of Agents with Real-World Tools

Arxiv

0+阅读 · 4月20日

Agentic Explainability at Scale: Between Corporate Fears and XAI Needs

Arxiv

0+阅读 · 4月16日

BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning

Arxiv

0+阅读 · 4月10日

View-oriented Conversation Compiler for Agent Trace Analysis

Arxiv

0+阅读 · 4月1日

Describe-Then-Act: Proactive Agent Steering via Distilled Language-Action World Models

Arxiv

0+阅读 · 3月24日

VIP会员

文章信息

相关主题

最新内容

DeepSeek 版Claude Code，免费小白安装教程来了！

DeepSeek 版Claude Code，免费小白安装教程来了！

专知会员服务

7+阅读 · 5月5日

【ICML Spotlight 2026】 T²PO: 不确定性引导的探索控制框架，实现稳定多轮Agentic强化学习

【ICML Spotlight 2026】 T²PO: 不确定性引导的探索控制框架，实现稳定多轮Agentic强化学习

专知会员服务

4+阅读 · 5月5日

基础模型驱动的工业智能体：技术成熟度、能力变迁与未竟之挑战

基础模型驱动的工业智能体：技术成熟度、能力变迁与未竟之挑战

专知会员服务

4+阅读 · 5月5日

《机动炮兵的演进与未来：技术进步、历史沿革与炮兵作战前瞻》

《机动炮兵的演进与未来：技术进步、历史沿革与炮兵作战前瞻》

专知会员服务

5+阅读 · 5月5日

《火炮弹药快速效能建模：提升互操作性与技术优势》（报告）

《火炮弹药快速效能建模：提升互操作性与技术优势》（报告）

专知会员服务

7+阅读 · 5月5日

《美空军条令出版物 2-0：情报（2026版）》

《美空军条令出版物 2-0：情报（2026版）》

专知会员服务

13+阅读 · 5月5日

美陆军“飞蝇陷阱5.0”项目将新兴技术交到作战人员手中

美陆军“飞蝇陷阱5.0”项目将新兴技术交到作战人员手中

专知会员服务

5+阅读 · 5月5日

帕兰提尔 Gotham：一个游戏规则改变器

帕兰提尔 Gotham：一个游戏规则改变器

专知会员服务

7+阅读 · 5月5日

【ICML 2026】用测试时训练线性化视觉Transformer：T⁵ 实现 Softmax 注意力到线性复杂度的快速转换

【ICML 2026】用测试时训练线性化视觉Transformer：T⁵ 实现 Softmax 注意力到线性复杂度的快速转换

专知会员服务

3+阅读 · 5月5日

【AAAI 2026】大模型做知识蒸馏：CMM将LLM特征拆解给小模型协同学习

【AAAI 2026】大模型做知识蒸馏：CMM将LLM特征拆解给小模型协同学习

专知会员服务

3+阅读 · 5月5日

【ICML Spotlight 2026 】NonZero：交互引导探索的多智能体蒙特卡洛树搜索

【ICML Spotlight 2026 】NonZero：交互引导探索的多智能体蒙特卡洛树搜索

专知会员服务

8+阅读 · 5月4日

【综述】机器人学习中的世界模型：全面综述

【综述】机器人学习中的世界模型：全面综述

专知会员服务

12+阅读 · 5月4日

伊朗的导弹-无人机行动及其对美国威慑的影响

伊朗的导弹-无人机行动及其对美国威慑的影响

专知会员服务

9+阅读 · 5月4日

《未来战术无人机系统案例研究：量身定制采办策略方法》100页报告

《未来战术无人机系统案例研究：量身定制采办策略方法》100页报告

专知会员服务

9+阅读 · 5月4日

战争贩子：2026年第一季度美国对中东潜在军售激增

战争贩子：2026年第一季度美国对中东潜在军售激增

专知会员服务

7+阅读 · 5月4日

相关VIP内容

最新新Agent综述！76页327篇论文梳理，北交大桑基韬教授团队发布《迈向模型原生智能体式人工智能的范式转变综述》

最新新Agent综述！76页327篇论文梳理，北交大桑基韬教授团队发布《迈向模型原生智能体式人工智能的范式转变综述》

专知会员服务

40+阅读 · 2025年10月17日

Agent有望定义万亿劳动力市场

Agent有望定义万亿劳动力市场

专知会员服务

19+阅读 · 2025年6月11日

AI行业专题报告：工具生态逐步完善，通用Agent曙光已现

AI行业专题报告：工具生态逐步完善，通用Agent曙光已现

专知会员服务

32+阅读 · 2025年3月27日

Agent视域下的人工智能赋能作战系统

Agent视域下的人工智能赋能作战系统

专知会员服务

57+阅读 · 2024年12月15日

大模型安全性，Google DeepMind Nicholas Carlini，附191页slides与视频

大模型安全性，Google DeepMind Nicholas Carlini，附191页slides与视频

专知会员服务

31+阅读 · 2024年7月15日

Al Agent--大模型时代重要落地方向

Al Agent--大模型时代重要落地方向

专知会员服务

106+阅读 · 2024年4月8日

数字世界中的大模型Agent：机遇与风险

数字世界中的大模型Agent：机遇与风险

专知会员服务

60+阅读 · 2023年12月25日

作战 Agent 的学习算法研究进展与发展趋势

作战 Agent 的学习算法研究进展与发展趋势

专知会员服务

71+阅读 · 2023年10月3日

【CVPR 2022】深度安全多视图聚类:降低因视图增加而导致聚类性能下降的风险，Deep Safe Multi-view Clustering: Reducing the Risk of Clustering Performance Degradation Caused by View Increase

【CVPR 2022】深度安全多视图聚类:降低因视图增加而导致聚类性能下降的风险，Deep Safe Multi-view Clustering: Reducing the Risk of Clustering Performance Degradation Caused by View Increase

专知会员服务

10+阅读 · 2022年3月12日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

【ICML Spotlight 2026】 T²PO: 不确定性引导的探索控制框架，实现稳定多轮Agentic强化学习

《机动炮兵的演进与未来：技术进步、历史沿革与炮兵作战前瞻》

DeepSeek 版Claude Code，免费小白安装教程来了！

基础模型驱动的工业智能体：技术成熟度、能力变迁与未竟之挑战

相关资讯

初学者系列：Attentional Factorization Machines（AFM）详解

初学者系列：Attentional Factorization Machines（AFM）详解

专知

82+阅读 · 2019年9月16日

【论文】Awesome Relation Extraction Paper（关系抽取）（PART V）

【论文】Awesome Relation Extraction Paper（关系抽取）（PART V）

AINLP

38+阅读 · 2019年9月3日

【论文】Awesome Relation Extraction Paper（关系抽取）（PART III）

【论文】Awesome Relation Extraction Paper（关系抽取）（PART III）

AINLP

25+阅读 · 2019年8月21日

【泡泡图灵智库】Detect-SLAM：目标检测和SLAM相互收益

【泡泡图灵智库】Detect-SLAM：目标检测和SLAM相互收益

泡泡机器人SLAM

14+阅读 · 2019年6月28日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

深度学习中Attention Mechanism详细介绍：原理、分类及应用

深度学习中Attention Mechanism详细介绍：原理、分类及应用

深度学习与NLP

10+阅读 · 2019年2月18日

用 LDA 和 LSA 两种方法来降维和做 Topic 建模

用 LDA 和 LSA 两种方法来降维和做 Topic 建模

AI研习社

13+阅读 · 2018年8月24日

Focal Loss for Dense Object Detection

Focal Loss for Dense Object Detection

统计学习与视觉计算组

12+阅读 · 2018年3月15日

模型汇总24 - 深度学习中Attention Mechanism详细介绍：原理、分类及应用

模型汇总24 - 深度学习中Attention Mechanism详细介绍：原理、分类及应用

深度学习与NLP

12+阅读 · 2017年11月30日

白翔：趣谈“捕文捉字”-- 场景文字检测 | VALSE2017之十

白翔：趣谈“捕文捉字”-- 场景文字检测 | VALSE2017之十

深度学习大讲堂

19+阅读 · 2017年9月4日

相关论文

Can Coding Agents Reproduce Findings in Computational Materials Science?

Arxiv

0+阅读 · 5月1日

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

Arxiv

0+阅读 · 4月29日

ReCreate: Reasoning and Creating Domain Agents Driven by Experience

Arxiv

0+阅读 · 4月28日

Evaluating Plan Compliance in Autonomous Programming Agents

Arxiv

0+阅读 · 4月28日

SWE-chat: Coding Agent Interactions From Real Users in the Wild

Arxiv

0+阅读 · 4月22日

Evaluating Privilege Usage of Agents with Real-World Tools

Arxiv

0+阅读 · 4月20日

Agentic Explainability at Scale: Between Corporate Fears and XAI Needs

Arxiv

0+阅读 · 4月16日

BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning

Arxiv

0+阅读 · 4月10日

View-oriented Conversation Compiler for Agent Trace Analysis

Arxiv

0+阅读 · 4月1日

Describe-Then-Act: Proactive Agent Steering via Distilled Language-Action World Models

Arxiv

0+阅读 · 3月24日

相关基金

局部腐蚀航空铝合金损伤失效过程多尺度实验研究

国家自然科学基金

1+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

47+阅读 · 2015年12月31日

自我损耗对工作场所安全绩效的影响及缓解途径

国家自然科学基金

0+阅读 · 2015年12月31日

基于时间反演的结构界面损伤监测理论及试验研究

国家自然科学基金

0+阅读 · 2014年12月31日

面向大数据的城市地下工程施工期安全风险评估方法研究

国家自然科学基金

0+阅读 · 2014年12月31日

嵌入视角下建筑工人职业流动问题的行为仿真及对策评价

国家自然科学基金

0+阅读 · 2014年12月31日

地铁施工安全风险时空耦合机理及实景仿真预警技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

预制装配型钢混凝土梁受力行为与设计方法研究

国家自然科学基金

0+阅读 · 2014年12月31日

自动化集装箱码头装卸作业的时空同步策略与优化方法

国家自然科学基金

1+阅读 · 2014年12月31日

面向人与Agent混合的多团队协作仿真训练方法研究

国家自然科学基金

19+阅读 · 2012年12月31日

微信扫码咨询专知VIP会员