Empathetic response generation is a desirable aspect of conversational agents, crucial for facilitating engaging and emotionally intelligent multi-turn conversations between humans and machines. Leveraging large language models for this task has shown promising results, yet challenges persist in ensuring both the empathetic quality of the responses and retention of the generalization performance of the models. We propose a novel approach where we construct theory-driven preference datasets based on emotion grounding and use them to align LLMs with preference optimization algorithms to address these challenges. To evaluate empathetic response generation, we employ the EmpatheticDialogues dataset, assessing empathy with the diff-Epitome and BERTscore metrics and with multi-dimensional human evaluation. Additionally, we measure diversity and emotional valence using feature-based methods. We also evaluate the impact of training on the generalization performance using the MMLU benchmark and tasks from the Open LLM Leaderboard. The results show that LLMs can be aligned for empathetic response generation by preference optimization while retaining their general performance and that emotion grounding can guide preference dataset creation. We make all datasets, source code, and models publicly available. https://github.com/justtherightsize/empo
翻译:共情回复生成是对话代理的理想特性,对于促进人机之间富有吸引力且具备情感智能的多轮对话至关重要。利用大语言模型完成此任务已展现出有前景的结果,但确保回复的共情质量同时保持模型的泛化性能仍存在挑战。我们提出一种新颖方法,基于情感接地构建理论驱动的偏好数据集,并利用偏好优化算法对齐大语言模型以应对这些挑战。为评估共情回复生成,我们采用EmpatheticDialogues数据集,通过diff-Epitome和BERTscore指标以及多维人工评估来衡量共情能力。此外,我们使用基于特征的方法评估回复多样性和情感效价。同时,我们通过MMLU基准测试和Open LLM Leaderboard任务评估训练对模型泛化性能的影响。结果表明,通过偏好优化可以对齐大语言模型以实现共情回复生成,同时保持其通用性能,且情感接地能够指导偏好数据集的构建。我们已公开所有数据集、源代码和模型。https://github.com/justtherightsize/empo