This paper introduces EmplifAI, a Japanese empathetic dialogue dataset designed to support patients coping with chronic medical conditions. They often experience a wide range of positive and negative emotions (e.g., hope and despair) that shift across different stages of disease management. EmplifAI addresses this complexity by providing situation-based dialogues grounded in 28 fine-grained emotion categories, adapted and validated from the GoEmotions taxonomy. The dataset includes 280 medically contextualized situations and 4125 two-turn dialogues, collected through crowdsourcing and expert review. To evaluate emotional alignment in empathetic dialogues, we assessed model predictions on situation--dialogue pairs using BERTScore across multiple large language models (LLMs), achieving F1 scores of 0.83. Fine-tuning a baseline Japanese LLM (LLM-jp-3.1-13b-instruct4) with EmplifAI resulted in notable improvements in fluency, general empathy, and emotion-specific empathy. Furthermore, we compared the scores assigned by LLM-as-a-Judge and human raters on dialogues generated by multiple LLMs to validate our evaluation pipeline and discuss the insights and potential risks derived from the correlation analysis.
翻译:本文介绍了EmpathAI,一个旨在帮助慢性病患者应对疾病的日语共情对话数据集。此类患者常经历广泛的正向与负向情感(例如希望与绝望),这些情感在疾病管理的不同阶段会发生转变。EmpathAI通过提供基于28种细粒度情感类别的、情境驱动的对话来应对这一复杂性,该情感分类体系改编并验证自GoEmotions分类法。数据集包含280个医学情境和4125组两轮对话,通过众包和专家评审收集。为评估共情对话中的情感对齐度,我们使用BERTScore评估了多个大语言模型(LLM)在情境-对话配对上的预测表现,获得了0.83的F1分数。使用EmpathAI对基准日语大语言模型(LLM-jp-3.1-13b-instruct4)进行微调后,其在流畅性、通用共情能力和特定情感共情能力方面均取得显著提升。此外,我们比较了LLM即评委与人类评分者对多个大语言模型生成对话的评分,以验证评估流程的可靠性,并基于相关性分析探讨了研究启示与潜在风险。