The proliferation of e-commerce has made web shopping platforms key gateways for customers navigating the vast digital marketplace. Yet this rapid expansion has led to a noisy and fragmented information environment, increasing cognitive burden as shoppers explore and purchase products online. With promising potential to alleviate this challenge, agentic systems have garnered growing attention for automating user-side tasks in web shopping. Despite significant advancements, existing benchmarks fail to comprehensively evaluate how well agentic systems can curate products in open-web settings. Specifically, they have limited coverage of shopping scenarios, focusing only on simplified single-platform lookups rather than exploratory search. Moreover, they overlook personalization in evaluation, leaving unclear whether agents can adapt to diverse user preferences in realistic shopping contexts. To address this gap, we present AgenticShop, the first benchmark for evaluating agentic systems on personalized product curation in open-web environment. Crucially, our approach features realistic shopping scenarios, diverse user profiles, and a verifiable, checklist-driven personalization evaluation framework. Through extensive experiments, we demonstrate that current agentic systems remain largely insufficient, emphasizing the need for user-side systems that effectively curate tailored products across the modern web.
翻译:电子商务的蓬勃发展使网络购物平台成为消费者在广阔数字市场中导航的关键门户。然而,这种快速扩张导致了嘈杂且碎片化的信息环境,增加了消费者在线探索和购买产品时的认知负担。智能体系统在自动化用户端网络购物任务方面展现出巨大潜力,为缓解这一挑战提供了可能,因而受到越来越多的关注。尽管已有显著进展,现有基准评测仍无法全面评估智能体系统在开放网络环境中策展产品的能力。具体而言,现有评测对购物场景的覆盖有限,仅关注简化的单平台查找而非探索式搜索。此外,它们在评估中忽视了个性化因素,导致无法明确智能体能否在真实购物场景中适应多样化的用户偏好。为填补这一空白,我们提出了AgenticShop——首个用于评估开放网络环境中个性化产品策展智能体系统的基准评测。我们的方法核心在于:真实的购物场景、多样化的用户画像,以及可验证的、基于检查表的个性化评估框架。通过大量实验,我们证明当前智能体系统仍存在明显不足,这凸显了开发能够在现代网络环境中有效策展定制化产品的用户端系统的迫切需求。