In e-commerce, LLM agents show promise for shopping tasks such as recommendations, budgeting, and bundle deals, where accurately capturing user preferences from long-term conversations is critical. However, two challenges hinder realizing this potential: (1) the absence of benchmarks for evaluating long-term preference-aware shopping tasks, and (2) the lack of end-to-end optimization due to existing designs that treat preference identification and shopping assistance as separate components. In this paper, we introduce a novel benchmark with a long-term memory setup, spanning two shopping tasks over 1.2 million real-world products, and propose Shopping Companion, a unified framework that jointly tackles memory retrieval and shopping assistance while supporting user intervention. To train such capabilities, we develop a dual-reward reinforcement learning strategy with tool-wise rewards to handle the sparse and discontinuous rewards inherent in multi-turn interactions. Experimental results demonstrate that even state-of-the-art models (such as GPT-5) achieve success rates under 70% on our benchmark, highlighting the significant challenges in this domain. Notably, our lightweight LLM, trained with Shopping Companion, consistently outperforms strong baselines, achieving better preference capture and task performance, which validates the effectiveness of our unified design.
翻译:在电子商务领域,LLM智能体在推荐、预算规划与组合优惠等购物任务中展现出潜力,其中从长期对话中准确捕捉用户偏好至关重要。然而,两个挑战阻碍了该潜力的实现:(1) 缺乏用于评估长期偏好感知购物任务的基准测试;(2) 由于现有设计将偏好识别与购物辅助视为独立模块,导致缺乏端到端优化。本文提出了一种具有长期记忆设置的新型基准测试,涵盖两个购物任务并涉及超过120万真实世界商品,同时提出了"购物伴侣"——一个统一框架,该框架联合处理记忆检索与购物辅助,并支持用户干预。为训练此类能力,我们开发了具有工具级奖励的双奖励强化学习策略,以应对多轮交互中固有的稀疏与不连续奖励问题。实验结果表明,即使是最先进的模型(如GPT-5)在我们的基准测试中成功率也低于70%,凸显了该领域的重大挑战。值得注意的是,我们通过"购物伴侣"训练的轻量级LLM始终优于强基线模型,实现了更好的偏好捕捉与任务性能,这验证了我们统一设计的有效性。