In recent years, with the rapid advancements in large language models (LLMs), achieving excellent empathetic response capabilities has become a crucial prerequisite. Consequently, managing and understanding empathetic datasets have gained increasing significance. However, empathetic data are typically human-labeled, leading to insufficient datasets and wasted human labor. In this work, we present Synth-Empathy, an LLM-based data generation and quality and diversity selection pipeline that automatically generates high-quality empathetic data while discarding low-quality data. With the data generated from a low empathetic model, we are able to further improve empathetic response performance and achieve state-of-the-art (SoTA) results across multiple benchmarks. Moreover, our model achieves SoTA performance on various human evaluation benchmarks, demonstrating its effectiveness and robustness in real-world applications. Furthermore, we show the trade-off between data quantity and quality, providing insights into empathetic data generation and selection.
翻译:近年来,随着大语言模型(LLMs)的快速发展,实现卓越的共情响应能力已成为关键前提。因此,管理和理解共情数据集的重要性日益凸显。然而,共情数据通常由人工标注,导致数据集规模不足且浪费人力。在本工作中,我们提出了Synth-Empathy,一个基于LLM的数据生成及质量与多样性筛选流程,能够自动生成高质量共情数据,同时剔除低质量数据。利用一个低共情能力模型生成的数据,我们能够进一步提升共情响应性能,并在多个基准测试中取得最先进(SoTA)的结果。此外,我们的模型在各种人工评估基准上也实现了SoTA性能,证明了其在现实应用中的有效性和鲁棒性。更进一步,我们展示了数据数量与质量之间的权衡,为共情数据的生成与筛选提供了见解。