Conversational shopping assistants now serve hundreds of millions of customers, yet no existing benchmark jointly evaluates the open-ended multi-turn reasoning, domain expertise, and criterion-level quality that real shopping conversations demand. Shopping reasoning is unique among language model applications. Unlike factual question answering or verifiable code generation, it requires balancing subjective preferences, budget constraints, and cross-product trade-offs across multi-turn dialogue, capabilities absent from previous e-commerce and general-purpose benchmarks. We introduce the Shopping Reasoning Bench, an expert-authored benchmark of 525 missions (232 single-turn, 293 multi-turn) with 10863 importance-weighted binary rubrics authored by retail domain experts. These criteria are organized under a taxonomy of five reasoning categories and fifteen subcategories covering diverse demands such as preference refinement, trade-off analysis, and compatibility assessment. An evaluation of nine models across three families (GPT, Claude, Gemini) shows that pass rates reach only 57--77% overall. On multi-turn missions, all models score 13--29 points lower on optional above-and-beyond criteria than on required ones, and performance degrades 4--18 points as conversations progress. These gaps show that current models handle basic shopping assistance but fall short of expert-level advice, making Shopping Reasoning Bench a challenging testbed for future shopping assistant development.
翻译:对话式购物助手现已服务数亿客户,然而现有基准测试均未能同时评估真实购物对话所需的开放式多轮推理能力、领域专业知识及标准级质量。在语言模型应用中,购物推理独具特殊性——不同于事实性问答或可验证代码生成,它需要在多轮对话中平衡主观偏好、预算约束与跨产品权衡,这些能力是先前电子商务及通用基准测试所缺失的。我们提出购物推理基准测试,这是一项由零售领域专家撰写的基准,包含525个任务(232个单轮任务、293个多轮任务),涵盖10863项基于重要性加权的二元评估标准。这些标准按五大推理类别及十五个子类别的分类体系组织,覆盖偏好细化、权衡分析及兼容性评估等多样化需求。对三个模型系列(GPT、Claude、Gemini)中九种模型的评估显示,整体通过率仅为57%-77%。在多轮任务中,所有模型在可选性超越标准上的得分比必要标准低13-29分,且随着对话进行,性能下降4-18分。这些差距表明,当前模型能处理基础购物辅助功能,但未达到专家级建议水平,这使得购物推理基准测试成为未来购物助手开发的挑战性测试平台。