Empathy is critical for effective and satisfactory conversational communication. Prior efforts to measure conversational empathy mostly focus on expressed communicative intents -- that is, the way empathy is expressed. Yet, these works ignore the fact that conversation is also a collaboration involving both speakers and listeners. In contrast, we propose a multi-dimensional empathy evaluation framework to measure both expressed intents from the speaker's perspective and perceived empathy from the listener's perspective. We apply our proposed framework to analyze our internal customer-service dialogue. We find the two dimensions (expressed intent types and perceived empathy) are inter-connected, and perceived empathy has a high correlation with dialogue satisfaction levels. To reduce the annotation cost, we explore different options to automatically measure conversational empathy: prompting LLMs and training language model-based classifiers. Our experiments show that prompting methods with even popular models like GPT-4 and Flan family models perform relatively poorly on both public and our internal datasets. In contrast, instruction-finetuned classifiers based on Flan-T5 family models outperform prior works and competitive baselines. We conduct a detailed ablation study to give more insights into instruction finetuning method's strong performance.
翻译:共情对于实现有效且令人满意的对话交流至关重要。以往衡量对话共情的研究主要集中于表达性沟通意图——即共情的表达方式。然而,这些研究忽视了对话还是一种涉及说话者和倾听者双方的协作行为。相比之下,我们提出一个多维度共情评估框架,同时从说话者视角衡量表达意图和从倾听者视角衡量感知共情。我们将该框架应用于内部客服对话分析,发现表达意图类型与感知共情这两个维度相互关联,且感知共情与对话满意度高度相关。为降低标注成本,我们探索了自动衡量对话共情的不同方案:通过大语言模型(LLM)提示和训练基于语言模型的分类器。实验表明,即使使用GPT-4和Flan系列模型等流行模型,提示方法在公开数据集和内部数据集上表现均相对较差。相比之下,基于Flan-T5系列模型的指令微调分类器优于先前工作和竞争基线。我们通过详细消融实验进一步揭示指令微调方法强性能的内在机理。