ChatShopBuddy: Towards Reliable Conversational Shopping Agents via Reinforcement Learning

Conversational shopping agents represent a critical consumer-facing application of Large Language Model (LLM)-powered agents, yet how to effectively apply post-training Reinforcement Learning (RL) to optimize such agents remains underexplored. This work investigates RL-based optimization for shopping agents in real-world scenarios, where agents must simultaneously satisfy multiple interdependent objectives spanning objective metrics (product correctness), subjective qualities (persuasiveness), outcome rewards (final response quality), and process rewards (tool efficiency). We present a complete methodology to address this challenge. Specifically, we first construct SmartShopBench, a benchmark that captures diverse shopping intents with a hierarchical evaluation that decomposes complex quality requirements into measurable levels. Building on this evaluation framework, we design Hierarchical Reward Modeling (HRM) to structure mixed reward types through conditional gating that reflects their logical dependencies. To enable efficient training, we further propose Dynamic Contrastive Policy Optimization (DCPO), which balances response quality with operational efficiency through dynamic trajectory selection based on reward and reasoning length. Extensive experiments demonstrate that our RL-trained agent, namely ChatShopBuddy, consistently outperforms larger models relying on generic reasoning, achieving superior stability rather than merely higher peaks. Our work provides valuable guidance for applying RL to real-world conversational agents.

翻译：对话购物智能体是大语言模型智能体面向消费者的关键应用，然而如何有效应用训练后强化学习来优化此类智能体仍未得到充分探索。本研究探讨了现实场景中购物智能体的基于强化学习的优化问题，其中智能体必须同时满足多个相互依赖的目标，涵盖客观指标（产品正确性）、主观品质（说服力）、结果奖励（最终响应质量）和过程奖励（工具效率）。我们提出了一套完整的应对方法。具体而言，我们首先构建了SmartShopBench基准测试，该基准通过分层评估捕捉多样化的购物意图，将复杂的质量要求分解为可衡量的层级。基于此评估框架，我们设计了分层奖励建模，通过反映逻辑依赖关系的条件门控来构建混合奖励类型。为实现高效训练，我们进一步提出动态对比策略优化，该方法基于奖励和推理长度进行动态轨迹选择，从而在响应质量与操作效率之间取得平衡。大量实验表明，我们通过强化学习训练的智能体，即ChatShopBuddy，在稳定性和整体性能上持续优于依赖通用推理的更大规模模型，实现了更优的稳定性而非仅追求更高的峰值表现。本研究为将强化学习应用于现实世界对话智能体提供了有价值的指导。