SentinelBench: A Benchmark for Long-Running Monitoring Agents - 专知论文

会员服务 ·

0

Agent · WEB · MoDELS · 回合 · AI ·

SentinelBench: A Benchmark for Long-Running Monitoring Agents

翻译：暂无翻译

Matheus Kunzler Maldaner,Adam Fourney,Amanda Swearngin,Hussein Mozannar,Gagan Bansal,Maya Murad,Rafah Hosn,Saleema Amershi

from arxiv, 18 pages, 16 figures

AI agents are increasingly asked to carry out work that spans minutes, hours, or longer. Yet the default model of agent behavior is continuous action: issuing tool calls, refreshing pages, searching for alternatives, or otherwise trying to force progress. This is the wrong approach for many long-running tasks, which are better served by a strategy of sustained attention. Instead, agents should monitor an environment, notice when an external event makes progress possible, then respond promptly without wasting resources while waiting. To measure progress on this class of tasks, we introduce SentinelBench, an open-source benchmark for time-evolving monitoring tasks. SentinelBench contains 100 tasks across 10 synthetic web environments, including email, calendars, finance, professional networking, and entertainment. Each environment exposes a live web interface and replays a scripted sequence of events, requiring agents to navigate and reason about web pages whose state shifts underfoot. SentinelBench measures task completion, reaction time, and resource use, exposing the tradeoff between responsiveness and cost. We report results across three models and two browser-agent harnesses, establishing performance baselines for future comparison and demonstrating how agent design choices can dramatically impact key metrics. Together, these results show that SentinelBench distinguishes meaningful differences in agent behavior.

翻译：暂无翻译

0

相关内容

Agent

2026 年 Agentic AI 工程师完全指南：一份系统化的学习路线图

2026 年 Agentic AI 工程师完全指南：一份系统化的学习路线图

专知会员服务

48+阅读 · 4月14日

Agent有望定义万亿劳动力市场

Agent有望定义万亿劳动力市场

专知会员服务

19+阅读 · 2025年6月11日

AI Agent深度（二）：2025 Agent元年，AI从L2向L3发展

AI Agent深度（二）：2025 Agent元年，AI从L2向L3发展

专知会员服务

45+阅读 · 2025年5月5日

AI行业专题报告：工具生态逐步完善，通用Agent曙光已现

AI行业专题报告：工具生态逐步完善，通用Agent曙光已现

专知会员服务

33+阅读 · 2025年3月27日

中国AI Agent行业研究报告（二）

中国AI Agent行业研究报告（二）

专知会员服务

48+阅读 · 2025年3月13日

人工智能专题报告：Operator和Manus打开AI Agent时代

人工智能专题报告：Operator和Manus打开AI Agent时代

专知会员服务

64+阅读 · 2025年3月12日

【AI Agent行业深度】框架、应用方向、应用领域及相关公司一文深度梳理！（附下载）

【AI Agent行业深度】框架、应用方向、应用领域及相关公司一文深度梳理！（附下载）

专知会员服务

144+阅读 · 2024年1月1日

AI Agent，大模型时代重要落地方向, 42页ppt

AI Agent，大模型时代重要落地方向, 42页ppt

专知会员服务

291+阅读 · 2023年10月12日

AI Agent下一个热点？复旦最新86页《大型语言模型智能体的崛起与潜力》综述，详述LLM Agent: 大脑、感知和行动

AI Agent下一个热点？复旦最新86页《大型语言模型智能体的崛起与潜力》综述，详述LLM Agent: 大脑、感知和行动

专知会员服务

170+阅读 · 2023年9月15日

AI Agent：基于大模型的自主智能体

AI Agent：基于大模型的自主智能体

专知会员服务

250+阅读 · 2023年9月9日

【泡泡图灵智库】DenseFusion:基于迭代密集融合的6D目标姿态估计

【泡泡图灵智库】DenseFusion:基于迭代密集融合的6D目标姿态估计

泡泡机器人SLAM

16+阅读 · 2019年9月3日

【泡泡图灵智库】Detect-SLAM：目标检测和SLAM相互收益

【泡泡图灵智库】Detect-SLAM：目标检测和SLAM相互收益

泡泡机器人SLAM

14+阅读 · 2019年6月28日

用PyTorch做物体检测和追踪

用PyTorch做物体检测和追踪

AI研习社

12+阅读 · 2019年1月6日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

腾讯：机器学习构建通用的数据异常检测平台

腾讯：机器学习构建通用的数据异常检测平台

全球人工智能

11+阅读 · 2018年5月1日

讲透RCNN, Fast-RCNN, Faster-RCNN，将CNN用于目标检测

讲透RCNN, Fast-RCNN, Faster-RCNN，将CNN用于目标检测

数据挖掘入门与实战

18+阅读 · 2018年4月20日

Focal Loss for Dense Object Detection

Focal Loss for Dense Object Detection

统计学习与视觉计算组

12+阅读 · 2018年3月15日

【推荐】(TensorFlow)SSD实时手部检测与追踪（附代码）

【推荐】(TensorFlow)SSD实时手部检测与追踪（附代码）

机器学习研究会

11+阅读 · 2017年12月5日

原创 | Attention Modeling for Targeted Sentiment

原创 | Attention Modeling for Targeted Sentiment

黑龙江大学自然语言处理实验室

25+阅读 · 2017年11月5日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

基于子模优化的远程预警传感器管理研究

国家自然科学基金

2+阅读 · 2015年12月31日

基于管制通话语音个体特征的管制员不良工作状态识别方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于多源证据的繁忙水域交管雷达异常目标识别方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

47+阅读 · 2015年12月31日

长寿命空间机械臂在轨故障诊断、容错和预测策略研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于动态增益非线性干扰观测器的多智能体系统协调跟踪和干扰抑制

国家自然科学基金

1+阅读 · 2015年12月31日

基于图像属性和深度学习的大规模物体检测研究与应用

国家自然科学基金

1+阅读 · 2015年12月31日

物联网关键技术RFID系统安全测试的仿真架构.评估模型和受攻击模式的研究和实践

国家自然科学基金

2+阅读 · 2014年12月31日

面向人与Agent混合的多团队协作仿真训练方法研究

国家自然科学基金

19+阅读 · 2012年12月31日

基于群体智能的多Agent协作模型与适应性研究

国家自然科学基金

18+阅读 · 2009年12月31日

TRAP: Benchmark for Task-completion and Resistance to Active Privacy-extraction

Arxiv

0+阅读 · 6月17日

Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

Arxiv

0+阅读 · 6月17日

Dissecting model behavior through agent trajectories

Arxiv

0+阅读 · 6月17日

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

Arxiv

0+阅读 · 6月17日

When Agent Markets Arrive

Arxiv

0+阅读 · 5月28日

Anytime Detection of Strategic Deviations in Multi-Agent Systems

Arxiv

0+阅读 · 5月22日

Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI

Arxiv

0+阅读 · 5月21日

Sleeper Channels and Provenance Gates: Persistent Prompt Injection in Always-on Autonomous AI Agents

Arxiv

0+阅读 · 5月13日

Agentic Performance at the Edge: Insights from Benchmarking

Arxiv

0+阅读 · 5月11日

LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents

Arxiv

0+阅读 · 5月6日

VIP会员

文章信息

相关主题

最新内容

ICML 2026 Spotlight | SmoothSMoE：解析稀疏 MoE 路由不连续

ICML 2026 Spotlight | SmoothSMoE：解析稀疏 MoE 路由不连续

专知会员服务

3+阅读 · 6月18日

综述 | 周期表视角下的大模型推理：范式、方法与失败模式

综述 | 周期表视角下的大模型推理：范式、方法与失败模式

专知会员服务

4+阅读 · 6月18日

《廉价自杀式无人机战争的军事战略影响：乌克兰和伊朗案例研究》

《廉价自杀式无人机战争的军事战略影响：乌克兰和伊朗案例研究》

专知会员服务

9+阅读 · 6月18日

《面向反无人机作战的联邦式可解释射频–光电/红外情报融合：边缘人工智能优化、电子战韧性及分布式监视验证》

《面向反无人机作战的联邦式可解释射频–光电/红外情报融合：边缘人工智能优化、电子战韧性及分布式监视验证》

专知会员服务

7+阅读 · 6月18日

ICML 2026 | FR3D：解耦自车运动的未来动态三维重建世界模型

ICML 2026 | FR3D：解耦自车运动的未来动态三维重建世界模型

专知会员服务

4+阅读 · 6月17日

【伯克利博士论文】迈向可扩展与自我演进的大语言模型智能体

【伯克利博士论文】迈向可扩展与自我演进的大语言模型智能体

专知会员服务

6+阅读 · 6月17日

学习数据的几何：形状空间分析数学综述

学习数据的几何：形状空间分析数学综述

专知会员服务

6+阅读 · 6月17日

《现代防空系统综述：架构、传感器、拦截器及新兴威胁环境对基础设施受限防御环境的影响》2026最新长综述

《现代防空系统综述：架构、传感器、拦截器及新兴威胁环境对基础设施受限防御环境的影响》2026最新长综述

专知会员服务

8+阅读 · 6月17日

定向能反无人机系统最新发展动态

定向能反无人机系统最新发展动态

专知会员服务

7+阅读 · 6月17日

从燃煤战舰到算法战争：水面指挥的永恒要求

从燃煤战舰到算法战争：水面指挥的永恒要求

专知会员服务

4+阅读 · 6月17日

《短程弹道再入飞行器拦截时间中的一项异常现象》

《短程弹道再入飞行器拦截时间中的一项异常现象》

专知会员服务

6+阅读 · 6月17日

《基于回归方法与任务上下文的对抗环境动态战术网络报文优先级排序》

《基于回归方法与任务上下文的对抗环境动态战术网络报文优先级排序》

专知会员服务

7+阅读 · 6月17日

美智库《战术级指挥控制的迫切要求：构建弹性机动式指挥控制网络》报告

美智库《战术级指挥控制的迫切要求：构建弹性机动式指挥控制网络》报告

专知会员服务

5+阅读 · 6月17日

《韩国国防政策与军备出口：韩国安全与国防政策如何塑造其国防工业与军备出口格局》最新100页报告

《韩国国防政策与军备出口：韩国安全与国防政策如何塑造其国防工业与军备出口格局》最新100页报告

专知会员服务

5+阅读 · 6月17日

ICML 2026 | VOTP：用视频基础模型与最优传输，让离线偏好强化学习只需少量反馈

ICML 2026 | VOTP：用视频基础模型与最优传输，让离线偏好强化学习只需少量反馈

专知会员服务

6+阅读 · 6月16日

相关VIP内容

2026 年 Agentic AI 工程师完全指南：一份系统化的学习路线图

2026 年 Agentic AI 工程师完全指南：一份系统化的学习路线图

专知会员服务

48+阅读 · 4月14日

Agent有望定义万亿劳动力市场

Agent有望定义万亿劳动力市场

专知会员服务

19+阅读 · 2025年6月11日

AI Agent深度（二）：2025 Agent元年，AI从L2向L3发展

AI Agent深度（二）：2025 Agent元年，AI从L2向L3发展

专知会员服务

45+阅读 · 2025年5月5日

AI行业专题报告：工具生态逐步完善，通用Agent曙光已现

AI行业专题报告：工具生态逐步完善，通用Agent曙光已现

专知会员服务

33+阅读 · 2025年3月27日

中国AI Agent行业研究报告（二）

中国AI Agent行业研究报告（二）

专知会员服务

48+阅读 · 2025年3月13日

人工智能专题报告：Operator和Manus打开AI Agent时代

人工智能专题报告：Operator和Manus打开AI Agent时代

专知会员服务

64+阅读 · 2025年3月12日

【AI Agent行业深度】框架、应用方向、应用领域及相关公司一文深度梳理！（附下载）

【AI Agent行业深度】框架、应用方向、应用领域及相关公司一文深度梳理！（附下载）

专知会员服务

144+阅读 · 2024年1月1日

AI Agent，大模型时代重要落地方向, 42页ppt

AI Agent，大模型时代重要落地方向, 42页ppt

专知会员服务

291+阅读 · 2023年10月12日

AI Agent下一个热点？复旦最新86页《大型语言模型智能体的崛起与潜力》综述，详述LLM Agent: 大脑、感知和行动

AI Agent下一个热点？复旦最新86页《大型语言模型智能体的崛起与潜力》综述，详述LLM Agent: 大脑、感知和行动

专知会员服务

170+阅读 · 2023年9月15日

AI Agent：基于大模型的自主智能体

AI Agent：基于大模型的自主智能体

专知会员服务

250+阅读 · 2023年9月9日

热门VIP内容

开通专知VIP会员享更多权益服务

综述 | 周期表视角下的大模型推理：范式、方法与失败模式

《面向反无人机作战的联邦式可解释射频–光电/红外情报融合：边缘人工智能优化、电子战韧性及分布式监视验证》

ICML 2026 Spotlight | SmoothSMoE：解析稀疏 MoE 路由不连续

《廉价自杀式无人机战争的军事战略影响：乌克兰和伊朗案例研究》

相关资讯

【泡泡图灵智库】DenseFusion:基于迭代密集融合的6D目标姿态估计

【泡泡图灵智库】DenseFusion:基于迭代密集融合的6D目标姿态估计

泡泡机器人SLAM

16+阅读 · 2019年9月3日

【泡泡图灵智库】Detect-SLAM：目标检测和SLAM相互收益

【泡泡图灵智库】Detect-SLAM：目标检测和SLAM相互收益

泡泡机器人SLAM

14+阅读 · 2019年6月28日

用PyTorch做物体检测和追踪

用PyTorch做物体检测和追踪

AI研习社

12+阅读 · 2019年1月6日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

腾讯：机器学习构建通用的数据异常检测平台

腾讯：机器学习构建通用的数据异常检测平台

全球人工智能

11+阅读 · 2018年5月1日

讲透RCNN, Fast-RCNN, Faster-RCNN，将CNN用于目标检测

讲透RCNN, Fast-RCNN, Faster-RCNN，将CNN用于目标检测

数据挖掘入门与实战

18+阅读 · 2018年4月20日

Focal Loss for Dense Object Detection

Focal Loss for Dense Object Detection

统计学习与视觉计算组

12+阅读 · 2018年3月15日

【推荐】(TensorFlow)SSD实时手部检测与追踪（附代码）

【推荐】(TensorFlow)SSD实时手部检测与追踪（附代码）

机器学习研究会

11+阅读 · 2017年12月5日

原创 | Attention Modeling for Targeted Sentiment

原创 | Attention Modeling for Targeted Sentiment

黑龙江大学自然语言处理实验室

25+阅读 · 2017年11月5日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

相关论文

TRAP: Benchmark for Task-completion and Resistance to Active Privacy-extraction

Arxiv

0+阅读 · 6月17日

Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

Arxiv

0+阅读 · 6月17日

Dissecting model behavior through agent trajectories

Arxiv

0+阅读 · 6月17日

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

Arxiv

0+阅读 · 6月17日

When Agent Markets Arrive

Arxiv

0+阅读 · 5月28日

Anytime Detection of Strategic Deviations in Multi-Agent Systems

Arxiv

0+阅读 · 5月22日

Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI

Arxiv

0+阅读 · 5月21日

Sleeper Channels and Provenance Gates: Persistent Prompt Injection in Always-on Autonomous AI Agents

Arxiv

0+阅读 · 5月13日

Agentic Performance at the Edge: Insights from Benchmarking

Arxiv

0+阅读 · 5月11日

LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents

Arxiv

0+阅读 · 5月6日

相关基金

基于子模优化的远程预警传感器管理研究

国家自然科学基金

2+阅读 · 2015年12月31日

基于管制通话语音个体特征的管制员不良工作状态识别方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于多源证据的繁忙水域交管雷达异常目标识别方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

47+阅读 · 2015年12月31日

长寿命空间机械臂在轨故障诊断、容错和预测策略研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于动态增益非线性干扰观测器的多智能体系统协调跟踪和干扰抑制

国家自然科学基金

1+阅读 · 2015年12月31日

基于图像属性和深度学习的大规模物体检测研究与应用

国家自然科学基金

1+阅读 · 2015年12月31日

物联网关键技术RFID系统安全测试的仿真架构.评估模型和受攻击模式的研究和实践

国家自然科学基金

2+阅读 · 2014年12月31日

面向人与Agent混合的多团队协作仿真训练方法研究

国家自然科学基金

19+阅读 · 2012年12月31日

基于群体智能的多Agent协作模型与适应性研究

国家自然科学基金

18+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员