How Much Heavy Lifting Can an Agent Harness Do?: Measuring the LLM's Residual Role in a Planning Agent - 专知论文

会员服务 ·

0

How Much Heavy Lifting Can an Agent Harness Do?: Measuring the LLM's Residual Role in a Planning Agent

翻译：暂无翻译

Sungwoo Jung,Seonil Son

Agent harnesses -- the stateful programs that wrap a language model and decide what it sees at each step -- are now known to change end-to-end performance on a fixed model by as much as six times. That raises a question asked less often than it should be: how much of an agent's competence does the harness itself already carry, and how much genuinely still needs the LLM? We externalize a planning harness for noisy Collaborative Battleship into four progressively richer layers -- posterior belief tracking, declarative planning, symbolic reflec tion, and an LLM-backed revision gate -- under a common runtime, taking \emph{win rate} as the primary metric and \emph{F1} as secondary, and pre-specifying \emph{heavy lifting} as the single largest positive marginal to the primary metric. Across 54 games, declarative pla nning carries the heavy lifting ($+24.1$pp win rate over a belief-only harness, zero LLM calls); symbolic reflection is mechanistically real but calibration-sensitive, with signed board-level effects up to $\pm0.140$ F1 that cancel on aggregate; and LLM-backed revision ac tivates on only $4.3\%$ of turns with a bounded, non-monotonic effect. The contribution is methodological: once harness layers are made externally measurable, the LLM's role can be quantified as residual rather than assumed central.

翻译：暂无翻译

0

相关内容

《Hello-Agents》项目正式发布，一起从零学习智能体！

《Hello-Agents》项目正式发布，一起从零学习智能体！

专知会员服务

31+阅读 · 1月2日

最新新Agent综述！76页327篇论文梳理，北交大桑基韬教授团队发布《迈向模型原生智能体式人工智能的范式转变综述》

最新新Agent综述！76页327篇论文梳理，北交大桑基韬教授团队发布《迈向模型原生智能体式人工智能的范式转变综述》

专知会员服务

40+阅读 · 2025年10月17日

Agent有望定义万亿劳动力市场

Agent有望定义万亿劳动力市场

专知会员服务

19+阅读 · 2025年6月11日

从自我进化视角出发，全面解析LLM的推理能力技术演进路径

从自我进化视角出发，全面解析LLM的推理能力技术演进路径

专知会员服务

14+阅读 · 2025年3月6日

Al Agent--大模型时代重要落地方向

Al Agent--大模型时代重要落地方向

专知会员服务

106+阅读 · 2024年4月8日

如何构建LLM？Hugging Face创始人Thomas Wolf《2024年构建大型语言模型的小指南》

如何构建LLM？Hugging Face创始人Thomas Wolf《2024年构建大型语言模型的小指南》

专知会员服务

76+阅读 · 2024年3月12日

AI Agent下一个热点？复旦最新86页《大型语言模型智能体的崛起与潜力》综述，详述LLM Agent: 大脑、感知和行动

AI Agent下一个热点？复旦最新86页《大型语言模型智能体的崛起与潜力》综述，详述LLM Agent: 大脑、感知和行动

专知会员服务

170+阅读 · 2023年9月15日

BERT到底如何work的？A Primer in BERTology: What we know about how BERT works

BERT到底如何work的？A Primer in BERTology: What we know about how BERT works

专知会员服务

50+阅读 · 2020年2月28日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

《DeepGCNs: Making GCNs Go as Deep as CNNs》

《DeepGCNs: Making GCNs Go as Deep as CNNs》

专知会员服务

32+阅读 · 2019年10月17日

Wide&Deep模型的八个实战细节

Wide&Deep模型的八个实战细节

AINLP

12+阅读 · 2020年11月25日

参数少一半，效果还更好，天津大学和微软提出Transformer压缩模型

参数少一半，效果还更好，天津大学和微软提出Transformer压缩模型

机器之心

15+阅读 · 2019年7月13日

这套1600赞的NLP课程已开放，面向实战，视频代码都有丨资源

这套1600赞的NLP课程已开放，面向实战，视频代码都有丨资源

量子位

15+阅读 · 2019年7月9日

多图带你读懂 Transformers 的工作原理

多图带你读懂 Transformers 的工作原理

AI研习社

10+阅读 · 2019年3月18日

百度提出ERNIE，多项中文NLP任务表现出色（已开源）

百度提出ERNIE，多项中文NLP任务表现出色（已开源）

AI100

33+阅读 · 2019年3月16日

考考你的眼力＋细心度！

考考你的眼力＋细心度！

程序猿

11+阅读 · 2019年1月15日

从 Word Embedding 到 Bert：一起肢解 Bert！

从 Word Embedding 到 Bert：一起肢解 Bert！

人工智能头条

17+阅读 · 2018年12月11日

机器人操作的“圣杯问题” -- Bin Picking

机器人操作的“圣杯问题” -- Bin Picking

机器人学家

16+阅读 · 2018年8月2日

Word2Vec —— 深度学习的一小步，自然语言处理的一大步

Word2Vec —— 深度学习的一小步，自然语言处理的一大步

AI研习社

21+阅读 · 2018年6月14日

Focal Loss for Dense Object Detection

Focal Loss for Dense Object Detection

统计学习与视觉计算组

12+阅读 · 2018年3月15日

自相似序列的无理指数、分形及相关问题

国家自然科学基金

0+阅读 · 2015年12月31日

基于犹豫模糊语言信息的定性决策理论与方法

国家自然科学基金

2+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

47+阅读 · 2015年12月31日

带变号位势的Hamilton系统的同宿轨

国家自然科学基金

0+阅读 · 2014年12月31日

延迟Hamilton系统保结构算法研究及其应用

国家自然科学基金

0+阅读 · 2014年12月31日

随机Helmholtz型问题的数值方法

国家自然科学基金

0+阅读 · 2014年12月31日

随机辛算法和多辛算法

国家自然科学基金

2+阅读 · 2014年12月31日

服务交互中顾客价值共创行为的管理策略研究——人力资源管理的视角

国家自然科学基金

1+阅读 · 2014年12月31日

不确定环境下基于HTN的应急任务规划方法研究

国家自然科学基金

15+阅读 · 2012年12月31日

面向人与Agent混合的多团队协作仿真训练方法研究

国家自然科学基金

19+阅读 · 2012年12月31日

Agent Factories for High Level Synthesis: How Far Can General-Purpose Coding Agents Go in Hardware Optimization?

Arxiv

0+阅读 · 5月1日

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

Arxiv

0+阅读 · 4月30日

ReCreate: Reasoning and Creating Domain Agents Driven by Experience

Arxiv

0+阅读 · 4月28日

SafeHarness: Lifecycle-Integrated Security Architecture for LLM-based Agent Deployment

Arxiv

0+阅读 · 4月15日

ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents

Arxiv

0+阅读 · 4月11日

Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain

Arxiv

0+阅读 · 4月9日

How Much LLM Does a Self-Revising Agent Actually Need?

Arxiv

0+阅读 · 4月8日

Credential Leakage in LLM Agent Skills: A Large-Scale Empirical Study

Arxiv

0+阅读 · 4月3日

Do Agents Repair When Challenged -- or Just Reply? Challenge, Repair, and Public Correction in a Deployed Agent Forum

Arxiv

0+阅读 · 4月2日

What Capable Agents Must Know: Selection Theorems for Robust Decision-Making under Uncertainty

Arxiv

0+阅读 · 4月2日

VIP会员

文章信息

相关主题

最新内容

DeepSeek 版Claude Code，免费小白安装教程来了！

DeepSeek 版Claude Code，免费小白安装教程来了！

专知会员服务

1+阅读 · 今天16:16

【ICML Spotlight 2026】 T²PO: 不确定性引导的探索控制框架，实现稳定多轮Agentic强化学习

【ICML Spotlight 2026】 T²PO: 不确定性引导的探索控制框架，实现稳定多轮Agentic强化学习

专知会员服务

1+阅读 · 今天16:08

基础模型驱动的工业智能体：技术成熟度、能力变迁与未竟之挑战

基础模型驱动的工业智能体：技术成熟度、能力变迁与未竟之挑战

专知会员服务

0+阅读 · 今天16:08

《机动炮兵的演进与未来：技术进步、历史沿革与炮兵作战前瞻》

《机动炮兵的演进与未来：技术进步、历史沿革与炮兵作战前瞻》

专知会员服务

2+阅读 · 今天14:09

《火炮弹药快速效能建模：提升互操作性与技术优势》（报告）

《火炮弹药快速效能建模：提升互操作性与技术优势》（报告）

专知会员服务

4+阅读 · 今天14:04

《美空军条令出版物 2-0：情报（2026版）》

《美空军条令出版物 2-0：情报（2026版）》

专知会员服务

7+阅读 · 今天13:54

美陆军“飞蝇陷阱5.0”项目将新兴技术交到作战人员手中

美陆军“飞蝇陷阱5.0”项目将新兴技术交到作战人员手中

专知会员服务

3+阅读 · 今天13:46

帕兰提尔 Gotham：一个游戏规则改变器

帕兰提尔 Gotham：一个游戏规则改变器

专知会员服务

5+阅读 · 今天13:34

【ICML 2026】用测试时训练线性化视觉Transformer：T⁵ 实现 Softmax 注意力到线性复杂度的快速转换

【ICML 2026】用测试时训练线性化视觉Transformer：T⁵ 实现 Softmax 注意力到线性复杂度的快速转换

专知会员服务

2+阅读 · 今天13:02

【AAAI 2026】大模型做知识蒸馏：CMM将LLM特征拆解给小模型协同学习

【AAAI 2026】大模型做知识蒸馏：CMM将LLM特征拆解给小模型协同学习

专知会员服务

2+阅读 · 今天12:07

【ICML Spotlight 2026 】NonZero：交互引导探索的多智能体蒙特卡洛树搜索

【ICML Spotlight 2026 】NonZero：交互引导探索的多智能体蒙特卡洛树搜索

专知会员服务

8+阅读 · 5月4日

【综述】机器人学习中的世界模型：全面综述

【综述】机器人学习中的世界模型：全面综述

专知会员服务

10+阅读 · 5月4日

伊朗的导弹-无人机行动及其对美国威慑的影响

伊朗的导弹-无人机行动及其对美国威慑的影响

专知会员服务

8+阅读 · 5月4日

《未来战术无人机系统案例研究：量身定制采办策略方法》100页报告

《未来战术无人机系统案例研究：量身定制采办策略方法》100页报告

专知会员服务

8+阅读 · 5月4日

战争贩子：2026年第一季度美国对中东潜在军售激增

战争贩子：2026年第一季度美国对中东潜在军售激增

专知会员服务

6+阅读 · 5月4日

相关VIP内容

《Hello-Agents》项目正式发布，一起从零学习智能体！

《Hello-Agents》项目正式发布，一起从零学习智能体！

专知会员服务

31+阅读 · 1月2日

最新新Agent综述！76页327篇论文梳理，北交大桑基韬教授团队发布《迈向模型原生智能体式人工智能的范式转变综述》

最新新Agent综述！76页327篇论文梳理，北交大桑基韬教授团队发布《迈向模型原生智能体式人工智能的范式转变综述》

专知会员服务

40+阅读 · 2025年10月17日

Agent有望定义万亿劳动力市场

Agent有望定义万亿劳动力市场

专知会员服务

19+阅读 · 2025年6月11日

从自我进化视角出发，全面解析LLM的推理能力技术演进路径

从自我进化视角出发，全面解析LLM的推理能力技术演进路径

专知会员服务

14+阅读 · 2025年3月6日

Al Agent--大模型时代重要落地方向

Al Agent--大模型时代重要落地方向

专知会员服务

106+阅读 · 2024年4月8日

如何构建LLM？Hugging Face创始人Thomas Wolf《2024年构建大型语言模型的小指南》

如何构建LLM？Hugging Face创始人Thomas Wolf《2024年构建大型语言模型的小指南》

专知会员服务

76+阅读 · 2024年3月12日

AI Agent下一个热点？复旦最新86页《大型语言模型智能体的崛起与潜力》综述，详述LLM Agent: 大脑、感知和行动

AI Agent下一个热点？复旦最新86页《大型语言模型智能体的崛起与潜力》综述，详述LLM Agent: 大脑、感知和行动

专知会员服务

170+阅读 · 2023年9月15日

BERT到底如何work的？A Primer in BERTology: What we know about how BERT works

BERT到底如何work的？A Primer in BERTology: What we know about how BERT works

专知会员服务

50+阅读 · 2020年2月28日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

《DeepGCNs: Making GCNs Go as Deep as CNNs》

《DeepGCNs: Making GCNs Go as Deep as CNNs》

专知会员服务

32+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

【ICML Spotlight 2026】 T²PO: 不确定性引导的探索控制框架，实现稳定多轮Agentic强化学习

《机动炮兵的演进与未来：技术进步、历史沿革与炮兵作战前瞻》

DeepSeek 版Claude Code，免费小白安装教程来了！

基础模型驱动的工业智能体：技术成熟度、能力变迁与未竟之挑战

相关资讯

Wide&Deep模型的八个实战细节

Wide&Deep模型的八个实战细节

AINLP

12+阅读 · 2020年11月25日

参数少一半，效果还更好，天津大学和微软提出Transformer压缩模型

参数少一半，效果还更好，天津大学和微软提出Transformer压缩模型

机器之心

15+阅读 · 2019年7月13日

这套1600赞的NLP课程已开放，面向实战，视频代码都有丨资源

这套1600赞的NLP课程已开放，面向实战，视频代码都有丨资源

量子位

15+阅读 · 2019年7月9日

多图带你读懂 Transformers 的工作原理

多图带你读懂 Transformers 的工作原理

AI研习社

10+阅读 · 2019年3月18日

百度提出ERNIE，多项中文NLP任务表现出色（已开源）

百度提出ERNIE，多项中文NLP任务表现出色（已开源）

AI100

33+阅读 · 2019年3月16日

考考你的眼力＋细心度！

考考你的眼力＋细心度！

程序猿

11+阅读 · 2019年1月15日

从 Word Embedding 到 Bert：一起肢解 Bert！

从 Word Embedding 到 Bert：一起肢解 Bert！

人工智能头条

17+阅读 · 2018年12月11日

机器人操作的“圣杯问题” -- Bin Picking

机器人操作的“圣杯问题” -- Bin Picking

机器人学家

16+阅读 · 2018年8月2日

Word2Vec —— 深度学习的一小步，自然语言处理的一大步

Word2Vec —— 深度学习的一小步，自然语言处理的一大步

AI研习社

21+阅读 · 2018年6月14日

Focal Loss for Dense Object Detection

Focal Loss for Dense Object Detection

统计学习与视觉计算组

12+阅读 · 2018年3月15日

相关论文

Agent Factories for High Level Synthesis: How Far Can General-Purpose Coding Agents Go in Hardware Optimization?

Arxiv

0+阅读 · 5月1日

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

Arxiv

0+阅读 · 4月30日

ReCreate: Reasoning and Creating Domain Agents Driven by Experience

Arxiv

0+阅读 · 4月28日

SafeHarness: Lifecycle-Integrated Security Architecture for LLM-based Agent Deployment

Arxiv

0+阅读 · 4月15日

ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents

Arxiv

0+阅读 · 4月11日

Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain

Arxiv

0+阅读 · 4月9日

How Much LLM Does a Self-Revising Agent Actually Need?

Arxiv

0+阅读 · 4月8日

Credential Leakage in LLM Agent Skills: A Large-Scale Empirical Study

Arxiv

0+阅读 · 4月3日

Do Agents Repair When Challenged -- or Just Reply? Challenge, Repair, and Public Correction in a Deployed Agent Forum

Arxiv

0+阅读 · 4月2日

What Capable Agents Must Know: Selection Theorems for Robust Decision-Making under Uncertainty

Arxiv

0+阅读 · 4月2日

相关基金

自相似序列的无理指数、分形及相关问题

国家自然科学基金

0+阅读 · 2015年12月31日

基于犹豫模糊语言信息的定性决策理论与方法

国家自然科学基金

2+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

47+阅读 · 2015年12月31日

带变号位势的Hamilton系统的同宿轨

国家自然科学基金

0+阅读 · 2014年12月31日

延迟Hamilton系统保结构算法研究及其应用

国家自然科学基金

0+阅读 · 2014年12月31日

随机Helmholtz型问题的数值方法

国家自然科学基金

0+阅读 · 2014年12月31日

随机辛算法和多辛算法

国家自然科学基金

2+阅读 · 2014年12月31日

服务交互中顾客价值共创行为的管理策略研究——人力资源管理的视角

国家自然科学基金

1+阅读 · 2014年12月31日

不确定环境下基于HTN的应急任务规划方法研究

国家自然科学基金

15+阅读 · 2012年12月31日

面向人与Agent混合的多团队协作仿真训练方法研究

国家自然科学基金

19+阅读 · 2012年12月31日

微信扫码咨询专知VIP会员