Current evaluation paradigms for emotional support conversations tend to reward generic empathetic responses, yet they fail to assess whether the support is genuinely personalized to users' unique psychological profiles and contextual needs. We introduce EmoHarbor, an automated evaluation framework that adopts a User-as-a-Judge paradigm by simulating the user's inner world. EmoHarbor employs a Chain-of-Agent architecture that decomposes users' internal processes into three specialized roles, enabling agents to interact with supporters and complete assessments in a manner similar to human users. We instantiate this benchmark using 100 real-world user profiles that cover a diverse range of personality traits and situations, and define 10 evaluation dimensions of personalized support quality. Comprehensive evaluation of 20 advanced LLMs on EmoHarbor reveals a critical insight: while these models excel at generating empathetic responses, they consistently fail to tailor support to individual user contexts. This finding reframes the central challenge, shifting research focus from merely enhancing generic empathy to developing truly user-aware emotional support. EmoHarbor provides a reproducible and scalable framework to guide the development and evaluation of more nuanced and user-aware emotional support systems.
翻译:当前情感支持对话的评估范式倾向于奖励通用的共情回应,却未能评估所提供的支持是否真正针对用户独特的心理特征和情境需求进行了个性化。我们提出了EmoHarbor,这是一个自动化的评估框架,它采用“用户即评判者”的范式,通过模拟用户的内心世界来实现评估。EmoHarbor采用一种链式智能体架构,将用户的内部心理过程分解为三个专门角色,使智能体能够以类似于人类用户的方式与支持者互动并完成评估。我们使用100个涵盖多样化人格特质和情境的真实世界用户档案实例化了这一基准,并定义了10个评估个性化支持质量的维度。在EmoHarbor上对20个先进大语言模型进行的全面评估揭示了一个关键发现:尽管这些模型擅长生成共情回应,但它们始终无法根据个体用户的具体情境来定制支持。这一发现重新定义了核心挑战,将研究重点从仅仅增强通用共情能力,转向开发真正具备用户感知能力的情感支持系统。EmoHarbor提供了一个可复现且可扩展的框架,用以指导和评估更细致、更具用户感知能力的情感支持系统的开发。