Large Language Models (LLMs) have recently demonstrated strong potential in generating 'believable human-like' behavior in web environments. Prior work has explored augmenting training data with LLM-synthesized rationales and applying supervised fine-tuning (SFT) to enhance reasoning ability, which in turn can improve downstream action prediction. However, the performance of such approaches remains inherently bounded by the reasoning capabilities of the model used to generate the rationales. In this paper, we introduce Shop-R1, a novel reinforcement learning (RL) framework aimed at enhancing the reasoning ability of LLMs for simulation of real human behavior in online shopping environments. Specifically, Shop-R1 decomposes the human behavior simulation task into two stages: rationale generation and action prediction, each guided by distinct reward signals. For rationale generation, we leverage internal model signals (e.g., logit distributions) to guide the reasoning process in a self-supervised manner. For action prediction, we propose a hierarchical reward structure with difficulty-aware scaling to prevent reward hacking and enable fine-grained reward assignment. This design evaluates both high-level action types and the correctness of fine-grained sub-action details (attributes and values), rewarding outputs proportionally to their difficulty. Experimental results show that our method achieves a relative improvement of over 65% compared to the baseline. The project page is available at https://damon-demon.github.io/shop-r1.html.
翻译:大语言模型(LLMs)近期在生成网络环境中“可信的类人行为”方面展现出强大潜力。先前研究探索了利用LLM合成的推理过程来增强训练数据,并通过监督微调(SFT)提升模型的推理能力,进而改进下游动作预测。然而,此类方法的性能本质上受限于用于生成推理过程的模型自身的能力。本文提出Shop-R1——一种新颖的强化学习(RL)框架,旨在增强LLMs在在线购物环境中模拟真实人类行为的推理能力。具体而言,Shop-R1将人类行为模拟任务分解为两个阶段:推理生成与动作预测,每个阶段由不同的奖励信号引导。对于推理生成,我们利用内部模型信号(如logit分布)以自监督方式引导推理过程。对于动作预测,我们提出一种具有难度感知缩放的分层奖励结构,以防止奖励破解并实现细粒度奖励分配。该设计同时评估高层动作类型和细粒度子动作细节(属性与值)的正确性,并根据输出难度按比例给予奖励。实验结果表明,我们的方法相比基线实现了超过65%的相对性能提升。项目页面详见 https://damon-demon.github.io/shop-r1.html。