EIBench: A Simulator-Based Benchmark and Turn-Credit RL for Emotion Management

Emotional intelligence (EI) in Large Language Models (LLMs) is often evaluated through static understanding tasks or single-response dialogue generation. However, emotion management is interactive: a good model should not only recognize a user's emotion, but also improve the user's emotional and relational state over several turns. We introduce EIBench, a simulator-based benchmark for interactive emotion management. EIBench contains 2,222 scenarios, with 2,009 for training and 213 for held-out testing. The scenarios are organized by a 2x2 taxonomy covering Support, Defense, Repair, and Charm, which together capture different forms of support, boundary maintenance, trust repair, and rapport building. In each scenario, an LLM simulator plays the user, updates an emotion-relation state after each turn, and maps the final state to an anchor-based score. This design makes EIBench both an evaluation benchmark and a training environment: the final state gives the outcome reward, while the per-turn state updates provide dense feedback for RL. We evaluate 15 open- and closed-source LLMs. Current models perform well on support and rapport-building scenes, but struggle with boundary maintenance under user pressure. To improve the EI ability of LLMs, we propose Centered Turn-Credit GRPO (CTC-GRPO), a GRPO extension that reuses the simulator's per-turn state updates as dense turn-level feedback while preserving the final outcome reward. CTC-GRPO improves Qwen3-8B from -22.4 to +22.4 on EIBench and also improves on out-of-distribution evaluations including SAGE (+12.4) and EQBench3 (+20.9%). Our results show that simulator-tracked user states can support both evaluation and training for multi-turn emotion management.

翻译：摘要：大型语言模型（LLM）的情绪智能（EI）通常通过静态理解任务或单轮对话生成进行评估。然而，情绪管理具有交互性：一个优秀的模型不仅应识别用户的情绪，还应在多轮交互中改善用户的情绪与关系状态。我们提出EIBench——一个基于模拟器的交互式情绪管理基准测试。EIBench包含2,222个场景，其中2,009个用于训练，213个用于保留测试。这些场景按2×2分类体系组织，涵盖支持、防御、修复和魅力四类，分别对应不同形式的支持、边界维护、信任修复与关系建立。在每个场景中，LLM模拟器扮演用户角色，在每轮交互后更新情绪-关系状态，并将最终状态映射为基于锚点的评分。该设计使EIBench兼具评估基准与训练环境双重功能：最终状态提供结果奖励，而每轮状态更新为强化学习提供密集反馈。我们评估了15个开源与闭源LLM。当前模型在支持与关系建立场景中表现良好，但在用户施压下的边界维护方面存在困难。为提升LLM的情绪智能能力，我们提出中心化回合信用GRPO（CTC-GRPO），该GRPO扩展方法在保留最终结果奖励的同时，复用模拟器每轮状态更新作为密集的回合级反馈。CTC-GRPO使Qwen3-8B在EIBench上的得分从-22.4提升至+22.4，并在包括SAGE（+12.4）和EQBench3（+20.9%）在内的分布外评估中亦取得改进。实验结果表明，模拟器追踪的用户状态可同时支持多轮情绪管理的评估与训练。