Emotional support conversations require more than fluent responses. Supporters need to understand the seeker's situation and emotions, adopt an appropriate strategy, and respond in a natural, human-like manner. Despite advances in large language models, current systems often lack structured, psychology-informed reasoning. Additionally, it is challenging to enhance these systems through reinforcement learning because of unreliable reward signals. Moreover, reinforcement fine-tuning can amplify repetitive response patterns. We propose structured empathetic reasoning, which breaks support into three steps: conversation history analysis, multimodal emotional state inference, and strategy selection, prior to generating the final reply. To implement this, we introduce SER, a fine-grained dataset with step-level correctness labels and pairwise response preferences. We then present PEER, which uses GRPO with UnifiReward, a unified process-outcome reward model for evaluating both reasoning steps and final responses in multi-turn interactions. To reduce repetition, we enhance data with personality-based rewriting and down-weight redundant outputs. Comprehensive experiments show improved empathy, strategy alignment, and human-likeness without sacrificing diversity.
翻译:情感支持对话不仅需要流畅的回应。支持者需要理解求助者的处境与情绪,采取恰当策略,并以自然、类人的方式做出回应。尽管大语言模型取得了进展,但当前系统往往缺乏结构化的、基于心理学的推理。此外,由于奖励信号不可靠,通过强化学习来增强这些系统颇具挑战性。同时,强化微调可能放大重复性的回应模式。我们提出结构化共情推理,将支持过程分解为生成最终回复前的三步:对话历史分析、多模态情绪状态推断以及策略选择。为实现此目标,我们引入了SER,一个包含步骤级正确性标签与成对回应偏好的细粒度数据集。接着我们提出PEER,该方法采用结合UnifiReward的GRPO,UnifiReward是一种统一的过程-结果奖励模型,用于评估多轮交互中的推理步骤与最终回应。为减少重复,我们通过基于个性的重写增强数据,并下调冗余输出的权重。综合实验表明,本方法在保持多样性的同时,提升了共情能力、策略对齐度与类人性。