Existing reference-free turn-level evaluation metrics for chatbots inadequately capture the interaction between the user and the system. Consequently, they often correlate poorly with human evaluations. To address this issue, we propose a novel model-agnostic approach that leverages Conditional Pointwise Mutual Information (C-PMI) to measure the turn-level interaction between the system and the user based on a given evaluation dimension. Experimental results on the widely used FED dialogue evaluation dataset demonstrate that our approach significantly improves the correlation with human judgment compared with existing evaluation systems. By replacing the negative log-likelihood-based scorer with our proposed C-PMI scorer, we achieve a relative 62.6% higher Spearman correlation on average for the FED evaluation metric. Our code is publicly available at https://github.com/renll/C-PMI.
翻译:现有面向聊天机器人的无参考轮级评估指标未能充分捕捉用户与系统之间的交互,因此常与人工评估的相关性较差。针对这一问题,我们提出了一种新的模型无关方法,该方法利用条件点互信息(C-PMI)基于给定评估维度衡量系统与用户在对话轮次层面的交互。在广泛使用的FED对话评估数据集上的实验结果表明,与现有评估系统相比,我们的方法显著提升了与人工判断的相关性。通过将基于负对数似然的评分器替换为所提出的C-PMI评分器,FED评估指标的斯皮尔曼相关系数平均相对提升了62.6%。我们的代码已公开在https://github.com/renll/C-PMI。