A growing amount of literature critiques the current operationalizations of empathy based on loose definitions of the construct. Such definitions negatively affect dataset quality, model robustness, and evaluation reliability. We propose an empathy evaluation framework that operationalizes empathy close to its psychological origins. The framework measures the variance in responses of LLMs to prompts using existing metrics for empathy and emotional valence. The variance is introduced through the controlled generation of the prompts by varying social biases affecting context understanding, thus impacting empathetic understanding. The control over generation ensures high theoretical validity of the constructs in the prompt dataset. Also, it makes high-quality translation, especially into languages that currently have little-to-no way of evaluating empathy or bias, such as the Slavonic family, more manageable. Using chosen LLMs and various prompt types, we demonstrate the empathy evaluation with the framework, including multiple-choice answers and free generation. The variance in our initial evaluation sample is small and we were unable to measure convincing differences between the empathetic understanding in contexts given by different social groups. However, the results are promising because the models showed significant alterations their reasoning chains needed to capture the relatively subtle changes in the prompts. This provides the basis for future research into the construction of the evaluation sample and statistical methods for measuring the results.
翻译:当前越来越多的文献批评基于对共情概念的松散定义而进行的操作化。此类定义对数据集质量、模型鲁棒性和评估可靠性均产生负面影响。我们提出一种贴近心理学起源的共情评估框架。该框架利用现有的共情度量指标和情感效价指标,测量大型语言模型对提示词响应的方差。这种方差是通过控制生成提示词时改变影响语境理解的社会偏见而引入的,从而影响共情理解。生成过程的控制确保了提示数据集中构念的高理论效度。同时,它使得高质量翻译——特别是针对目前缺乏共情或偏见评估手段的语言(如斯拉夫语族)——更易于实现。通过选用特定大型语言模型及多种提示类型,我们展示了该框架下的共情评估,包括多项选择题答案和自由生成任务。在初始评估样本中,方差较小,我们未能测量出不同社会群体所给语境中,共情理解存在具有说服力的差异。然而,结果仍具前景,因为模型在推理链中展现出显著调整,以捕捉提示词中相对细微的变化。这为未来研究评估样本构建及结果测量的统计方法奠定了基础。