In recent years, with the rapid advancements in large language models (LLMs), achieving excellent empathetic response capability has become a crucial prerequisite. Consequently, managing and understanding large-scale video datasets has gained increasing importance. However, empathetic data are typically trained without any quality selection, leading to inefficient data usage and wasted computational resources. Additionally, using raw data can result in low performance in empathetic dialogues. In this work, we present Efficient-Empathy, a sensibility and rationality score-based data selection algorithm that automatically selects sensibility and rationality data while discarding low-quality data. With only the sensibility data (59% of the full dataset), our trained sensibility model efficiently achieves state-of-the-art (SoTA) performance. Furthermore, with multiple data selection hyperparameters, the sensibility model demonstrates SoTA performance, showcasing the robustness of our method. By integrating sensibility and rationality data with a MoE structure, we achieve even higher performance, demonstrating the effectiveness of our Efficient-Empathy algorithm.
翻译:近年来,随着大语言模型(LLM)的快速发展,实现卓越的共情响应能力已成为关键前提。因此,管理和理解大规模视频数据集变得越来越重要。然而,共情数据通常在未经任何质量筛选的情况下进行训练,导致数据使用效率低下和计算资源浪费。此外,使用原始数据可能导致共情对话性能低下。在本工作中,我们提出了Efficient-Empathy,一种基于感性与理性评分的数据选择算法,能够自动选择感性与理性数据,同时剔除低质量数据。仅使用感性数据(占完整数据集的59%),我们训练的感性模型即高效地达到了最先进(SoTA)性能。此外,通过多个数据选择超参数,感性模型展现出SoTA性能,证明了我们方法的鲁棒性。通过将感性与理性数据与MoE结构集成,我们实现了更高的性能,证明了Efficient-Empathy算法的有效性。