EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent - 专知论文

会员服务 ·

0

Agent · 全 · Amazon · Automator · 代码 ·

EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent

翻译：暂无翻译

Zeyao Du,Tong Li,Yanci Zhang,Haibo Zhang

As LLM-based shopping agents enter production, existing benchmarks fail to capture how a shopper's requirements arrive: stated implicitly in the query, recorded in a profile, or revealed only when the right question is asked. Benchmarks that expose full intent upfront and grade only the final choice can neither pose this long-horizon challenge nor explain which requirement an agent missed. To address this gap, we introduce EComAgentBench, a benchmark of 662 tasks grounded in real Amazon products and reviews. Each task scatters these requirements across a visible query, a tool-gated profile, and scripted clarification; an agent must uncover hidden intent, verify candidates against attributes and review evidence, and commit to a single product within 100 tool calls. Moreover, typed, source-tagged rubrics grade every task, attributing each failure to a requirement and its source. Construction is automated yet reliable, with every answer fixed in code before any text is generated and every sample validated. Our evaluation of seven models reveals that even the strongest attains only 57.1% overall accuracy, and rubric satisfaction degrades from visible to hidden sources. Overall, we believe EComAgentBench will serve as a reproducible foundation for moving shopping agents from single-query search toward dependable assistance over long horizons.

翻译：暂无翻译

0

相关内容

Agent

Agent Harness综述：大模型智能体执行器工程全景

Agent Harness综述：大模型智能体执行器工程全景

专知会员服务

25+阅读 · 5月28日

智能体评判者（Agent-as-a-Judge）研究综述

智能体评判者（Agent-as-a-Judge）研究综述

专知会员服务

37+阅读 · 1月9日

最新新Agent综述！76页327篇论文梳理，北交大桑基韬教授团队发布《迈向模型原生智能体式人工智能的范式转变综述》

最新新Agent综述！76页327篇论文梳理，北交大桑基韬教授团队发布《迈向模型原生智能体式人工智能的范式转变综述》

专知会员服务

42+阅读 · 2025年10月17日

Agent有望定义万亿劳动力市场

Agent有望定义万亿劳动力市场

专知会员服务

19+阅读 · 2025年6月11日

PlanGenLLMs：大型语言模型规划能力的最新综述

PlanGenLLMs：大型语言模型规划能力的最新综述

专知会员服务

34+阅读 · 2025年5月18日

大型语言模型推理引擎的综述：优化与效率的视角

大型语言模型推理引擎的综述：优化与效率的视角

专知会员服务

23+阅读 · 2025年5月13日

AI Agent下一个热点？复旦最新86页《大型语言模型智能体的崛起与潜力》综述，详述LLM Agent: 大脑、感知和行动

AI Agent下一个热点？复旦最新86页《大型语言模型智能体的崛起与潜力》综述，详述LLM Agent: 大脑、感知和行动

专知会员服务

170+阅读 · 2023年9月15日

基于Transformer嵌入模型的个性化产品搜索，A Transformer-based Embedding Model for Personalized Product Search

基于Transformer嵌入模型的个性化产品搜索，A Transformer-based Embedding Model for Personalized Product Search

专知会员服务

31+阅读 · 2020年5月20日

【WWW2020】解决推荐系统中目标客户失真问题，Addressing the Target Customer Distortion Problem in Recommender Systems

【WWW2020】解决推荐系统中目标客户失真问题，Addressing the Target Customer Distortion Problem in Recommender Systems

专知会员服务

10+阅读 · 2020年4月4日

【CIKM 2019论文】基于关系型图卷积网络的代理发起的社会化电子商务推荐（Relation-Aware Graph Convolutional Networks for Agent-Initiated Social E-Commerce Recommendation）

【CIKM 2019论文】基于关系型图卷积网络的代理发起的社会化电子商务推荐（Relation-Aware Graph Convolutional Networks for Agent-Initiated Social E-Commerce Recommendation）

专知会员服务

56+阅读 · 2019年11月20日

淘宝 at KDD 2020，提出M2GRL优化大规模推荐中的多任务多视角图表示学习

淘宝 at KDD 2020，提出M2GRL优化大规模推荐中的多任务多视角图表示学习

AINLP

23+阅读 · 2020年6月16日

BERT 瘦身之路：Distillation，Quantization，Pruning

BERT 瘦身之路：Distillation，Quantization，Pruning

AINLP

10+阅读 · 2019年10月22日

苏宁易购基于机器学习预测流量波动趋势的实践经验

苏宁易购基于机器学习预测流量波动趋势的实践经验

AI前线

15+阅读 · 2019年10月17日

可能是Amazon最后一批面经

可能是Amazon最后一批面经

九章算法

21+阅读 · 2019年5月5日

阿里巴巴最新成果：每一个商品的描述都是为你量身订做的

阿里巴巴最新成果：每一个商品的描述都是为你量身订做的

专知

14+阅读 · 2019年5月2日

10分钟搞定我5小时工作：麦肯锡顾问做商业分析的诀窍在这里

10分钟搞定我5小时工作：麦肯锡顾问做商业分析的诀窍在这里

行业研究报告

13+阅读 · 2018年11月20日

NLP实战：用主题建模分析网购评论（附Python代码）

NLP实战：用主题建模分析网购评论（附Python代码）

论智

18+阅读 · 2018年10月17日

零售商福音：用机器学习给产品定价实现收益最大化

零售商福音：用机器学习给产品定价实现收益最大化

论智

19+阅读 · 2018年9月28日

跨越注意力：Cross-Attention

跨越注意力：Cross-Attention

我爱读PAMI

172+阅读 · 2018年6月2日

原创 | Attention Modeling for Targeted Sentiment

原创 | Attention Modeling for Targeted Sentiment

黑龙江大学自然语言处理实验室

25+阅读 · 2017年11月5日

互联网商业模式价格形成机制与资源配置效率研究——基于消费者信息不完美与搜寻的博弈理论视角

国家自然科学基金

0+阅读 · 2015年12月31日

以用户为中心的电子商务大数据偏好查询处理与优化

国家自然科学基金

0+阅读 · 2015年12月31日

考虑价格歧视和广告效应的网络团购销售策略研究

国家自然科学基金

0+阅读 · 2015年12月31日

消费者知情购仿行为形成、演变与治理策略研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于社会网络的大型在线社区中虚拟商品购买行为研究

国家自然科学基金

0+阅读 · 2015年12月31日

实时排队控制下超市网络的非线性马氏过程与超指数结构

国家自然科学基金

0+阅读 · 2014年12月31日

B2C电子商务物流整体优化及动态调整方法研究

国家自然科学基金

0+阅读 · 2014年12月31日

网络购物平台商品质量管控作用机理及其演进研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于人眼关注度与情感分析的电子商务智能推荐计算

国家自然科学基金

0+阅读 · 2014年12月31日

服务交互中顾客价值共创行为的管理策略研究——人力资源管理的视角

国家自然科学基金

1+阅读 · 2014年12月31日

Paying to Know: Micro-Transaction Markets for Verified Product Information in Agentic E-Commerce

Arxiv

0+阅读 · 6月23日

More Skills, Worse Agents? Skill Shadowing Degrades Performance When Expanding Skill Libraries

Arxiv

0+阅读 · 6月23日

Critique of Agent Model

Arxiv

0+阅读 · 6月22日

When Agents Commit Too Soon: Diagnosing Premature Commitment in LLM Agents

Arxiv

0+阅读 · 6月22日

AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents

Arxiv

0+阅读 · 6月22日

RetailBench: Benchmarking long horizon reasoning and coherent decision making of LLM agents in realistic retail environments

Arxiv

0+阅读 · 6月18日

ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents

Arxiv

0+阅读 · 6月18日

PhantomSkill: Malicious Code Injection in Agent Skill Ecosystems

Arxiv

0+阅读 · 6月17日

EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent

Arxiv

0+阅读 · 6月16日

Benign in Isolation, Harmful in Composition: Security Risks in Agent Skill Ecosystems

Arxiv

0+阅读 · 6月13日

VIP会员

文章信息

相关主题

最新内容

无人机自主控制与人工智能：系统性综述

无人机自主控制与人工智能：系统性综述

专知会员服务

5+阅读 · 今天7:25

巡飞弹与反无人机系统——现代战场的两大支柱

巡飞弹与反无人机系统——现代战场的两大支柱

专知会员服务

2+阅读 · 今天6:54

《打造“黄金舰队”》57页报告

《打造“黄金舰队”》57页报告

专知会员服务

1+阅读 · 今天6:52

《北约数字教官网络发展路径》128页报告

《北约数字教官网络发展路径》128页报告

专知会员服务

1+阅读 · 今天6:33

ECCV 2026 | MIMFlow：MIM与归一化流统一图像生成

ECCV 2026 | MIMFlow：MIM与归一化流统一图像生成

专知会员服务

6+阅读 · 6月25日

超越自回归边界：扩散模型、世界模型与SSM如何重塑代码智能

超越自回归边界：扩散模型、世界模型与SSM如何重塑代码智能

专知会员服务

5+阅读 · 6月25日

重塑决策优势：美军作战艺术与多域作战中联盟联合全域指挥控制（CJADC2）体系的融合

重塑决策优势：美军作战艺术与多域作战中联盟联合全域指挥控制（CJADC2）体系的融合

专知会员服务

9+阅读 · 6月25日

网状网络及其在军事领域的运用

网状网络及其在军事领域的运用

专知会员服务

7+阅读 · 6月25日

《意识即战场——全球安全体系中认知战的演进：乌克兰构建认知作战体系的展望》

《意识即战场——全球安全体系中认知战的演进：乌克兰构建认知作战体系的展望》

专知会员服务

8+阅读 · 6月25日

无美国参与的欧洲战争方式（万字长文）

无美国参与的欧洲战争方式（万字长文）

专知会员服务

8+阅读 · 6月25日

重构“下一场战争”的制胜理论：超越兰彻斯特方程与现代系统

重构“下一场战争”的制胜理论：超越兰彻斯特方程与现代系统

专知会员服务

10+阅读 · 6月25日

《国防工业中基于模型定义的实施：产品定义数字化转型的战略路径》90页

《国防工业中基于模型定义的实施：产品定义数字化转型的战略路径》90页

专知会员服务

9+阅读 · 6月25日

《国防领域敏感性分析白皮书》

《国防领域敏感性分析白皮书》

专知会员服务

9+阅读 · 6月25日

综述 | 从问答到任务完成：Agent系统与Harness设计

综述 | 从问答到任务完成：Agent系统与Harness设计

专知会员服务

10+阅读 · 6月24日

Agentic RL：框架、实践与长程智能体训练

Agentic RL：框架、实践与长程智能体训练

专知会员服务

10+阅读 · 6月24日

相关VIP内容

Agent Harness综述：大模型智能体执行器工程全景

Agent Harness综述：大模型智能体执行器工程全景

专知会员服务

25+阅读 · 5月28日

智能体评判者（Agent-as-a-Judge）研究综述

智能体评判者（Agent-as-a-Judge）研究综述

专知会员服务

37+阅读 · 1月9日

最新新Agent综述！76页327篇论文梳理，北交大桑基韬教授团队发布《迈向模型原生智能体式人工智能的范式转变综述》

最新新Agent综述！76页327篇论文梳理，北交大桑基韬教授团队发布《迈向模型原生智能体式人工智能的范式转变综述》

专知会员服务

42+阅读 · 2025年10月17日

Agent有望定义万亿劳动力市场

Agent有望定义万亿劳动力市场

专知会员服务

19+阅读 · 2025年6月11日

PlanGenLLMs：大型语言模型规划能力的最新综述

PlanGenLLMs：大型语言模型规划能力的最新综述

专知会员服务

34+阅读 · 2025年5月18日

大型语言模型推理引擎的综述：优化与效率的视角

大型语言模型推理引擎的综述：优化与效率的视角

专知会员服务

23+阅读 · 2025年5月13日

AI Agent下一个热点？复旦最新86页《大型语言模型智能体的崛起与潜力》综述，详述LLM Agent: 大脑、感知和行动

AI Agent下一个热点？复旦最新86页《大型语言模型智能体的崛起与潜力》综述，详述LLM Agent: 大脑、感知和行动

专知会员服务

170+阅读 · 2023年9月15日

基于Transformer嵌入模型的个性化产品搜索，A Transformer-based Embedding Model for Personalized Product Search

基于Transformer嵌入模型的个性化产品搜索，A Transformer-based Embedding Model for Personalized Product Search

专知会员服务

31+阅读 · 2020年5月20日

【WWW2020】解决推荐系统中目标客户失真问题，Addressing the Target Customer Distortion Problem in Recommender Systems

【WWW2020】解决推荐系统中目标客户失真问题，Addressing the Target Customer Distortion Problem in Recommender Systems

专知会员服务

10+阅读 · 2020年4月4日

【CIKM 2019论文】基于关系型图卷积网络的代理发起的社会化电子商务推荐（Relation-Aware Graph Convolutional Networks for Agent-Initiated Social E-Commerce Recommendation）

【CIKM 2019论文】基于关系型图卷积网络的代理发起的社会化电子商务推荐（Relation-Aware Graph Convolutional Networks for Agent-Initiated Social E-Commerce Recommendation）

专知会员服务

56+阅读 · 2019年11月20日

热门VIP内容

开通专知VIP会员享更多权益服务

巡飞弹与反无人机系统——现代战场的两大支柱

《北约数字教官网络发展路径》128页报告

无人机自主控制与人工智能：系统性综述

《打造“黄金舰队”》57页报告

相关资讯

淘宝 at KDD 2020，提出M2GRL优化大规模推荐中的多任务多视角图表示学习

淘宝 at KDD 2020，提出M2GRL优化大规模推荐中的多任务多视角图表示学习

AINLP

23+阅读 · 2020年6月16日

BERT 瘦身之路：Distillation，Quantization，Pruning

BERT 瘦身之路：Distillation，Quantization，Pruning

AINLP

10+阅读 · 2019年10月22日

苏宁易购基于机器学习预测流量波动趋势的实践经验

苏宁易购基于机器学习预测流量波动趋势的实践经验

AI前线

15+阅读 · 2019年10月17日

可能是Amazon最后一批面经

可能是Amazon最后一批面经

九章算法

21+阅读 · 2019年5月5日

阿里巴巴最新成果：每一个商品的描述都是为你量身订做的

阿里巴巴最新成果：每一个商品的描述都是为你量身订做的

专知

14+阅读 · 2019年5月2日

10分钟搞定我5小时工作：麦肯锡顾问做商业分析的诀窍在这里

10分钟搞定我5小时工作：麦肯锡顾问做商业分析的诀窍在这里

行业研究报告

13+阅读 · 2018年11月20日

NLP实战：用主题建模分析网购评论（附Python代码）

NLP实战：用主题建模分析网购评论（附Python代码）

论智

18+阅读 · 2018年10月17日

零售商福音：用机器学习给产品定价实现收益最大化

零售商福音：用机器学习给产品定价实现收益最大化

论智

19+阅读 · 2018年9月28日

跨越注意力：Cross-Attention

跨越注意力：Cross-Attention

我爱读PAMI

172+阅读 · 2018年6月2日

原创 | Attention Modeling for Targeted Sentiment

原创 | Attention Modeling for Targeted Sentiment

黑龙江大学自然语言处理实验室

25+阅读 · 2017年11月5日

相关论文

Paying to Know: Micro-Transaction Markets for Verified Product Information in Agentic E-Commerce

Arxiv

0+阅读 · 6月23日

More Skills, Worse Agents? Skill Shadowing Degrades Performance When Expanding Skill Libraries

Arxiv

0+阅读 · 6月23日

Critique of Agent Model

Arxiv

0+阅读 · 6月22日

When Agents Commit Too Soon: Diagnosing Premature Commitment in LLM Agents

Arxiv

0+阅读 · 6月22日

AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents

Arxiv

0+阅读 · 6月22日

RetailBench: Benchmarking long horizon reasoning and coherent decision making of LLM agents in realistic retail environments

Arxiv

0+阅读 · 6月18日

ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents

Arxiv

0+阅读 · 6月18日

PhantomSkill: Malicious Code Injection in Agent Skill Ecosystems

Arxiv

0+阅读 · 6月17日

EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent

Arxiv

0+阅读 · 6月16日

Benign in Isolation, Harmful in Composition: Security Risks in Agent Skill Ecosystems

Arxiv

0+阅读 · 6月13日

相关基金

互联网商业模式价格形成机制与资源配置效率研究——基于消费者信息不完美与搜寻的博弈理论视角

国家自然科学基金

0+阅读 · 2015年12月31日

以用户为中心的电子商务大数据偏好查询处理与优化

国家自然科学基金

0+阅读 · 2015年12月31日

考虑价格歧视和广告效应的网络团购销售策略研究

国家自然科学基金

0+阅读 · 2015年12月31日

消费者知情购仿行为形成、演变与治理策略研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于社会网络的大型在线社区中虚拟商品购买行为研究

国家自然科学基金

0+阅读 · 2015年12月31日

实时排队控制下超市网络的非线性马氏过程与超指数结构

国家自然科学基金

0+阅读 · 2014年12月31日

B2C电子商务物流整体优化及动态调整方法研究

国家自然科学基金

0+阅读 · 2014年12月31日

网络购物平台商品质量管控作用机理及其演进研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于人眼关注度与情感分析的电子商务智能推荐计算

国家自然科学基金

0+阅读 · 2014年12月31日

服务交互中顾客价值共创行为的管理策略研究——人力资源管理的视角

国家自然科学基金

1+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员