We present ShoppingComp, a challenging real-world benchmark for comprehensively evaluating LLM-powered shopping agents on three core capabilities: precise product retrieval, expert-level report generation, and safety critical decision making. Unlike prior e-commerce benchmarks, ShoppingComp introduces difficult product discovery queries with many constraints, while guaranteeing open-world products and enabling easy verification of agent outputs. The benchmark comprises 145 instances and 558 scenarios, curated by 35 experts to reflect authentic shopping needs. Results reveal stark limitations of current LLMs: even state-of-the-art models achieve low performance (e.g., 17.76\% for GPT-5.2, 15.82\% for Gemini-3-Pro).Error analysis reflects limitations in core agent competencies, including information grounding in open-world environments, reliable verification of multi-constraint requirements, consistent reasoning over noisy and conflicting evidence, and risk-aware decision making. By exposing these capability gaps, ShoppingComp characterizes the trust threshold that AI systems must cross before they can be proactively trusted for reliable real-world decision making. Our code and dataset are available at https://github.com/ByteDance-BandAI/ShoppingComp.
翻译:我们提出了ShoppingComp,这是一个具有挑战性的真实世界基准测试,用于全面评估基于大型语言模型(LLM)的购物代理在三个核心能力上的表现:精确的产品检索、专家级报告生成以及安全关键决策。与以往的电子商务基准不同,ShoppingComp引入了包含多重约束的困难产品发现查询,同时保证使用开放世界的产品,并便于验证代理的输出。该基准包含145个实例和558个场景,由35位专家精心策划,以反映真实的购物需求。结果揭示了当前LLMs的显著局限性:即使是最先进的模型也表现不佳(例如,GPT-5.2为17.76%,Gemini-3-Pro为15.82%)。错误分析反映了核心代理能力的不足,包括在开放世界环境中的信息落地、对多重约束要求的可靠验证、在嘈杂和矛盾证据上的一致性推理,以及风险感知决策。通过揭示这些能力差距,ShoppingComp刻画了人工智能系统在能够被主动信任以进行可靠的现实世界决策之前必须跨越的信任阈值。我们的代码和数据集可在 https://github.com/ByteDance-BandAI/ShoppingComp 获取。