Existing reference-free turn-level evaluation metrics for chatbots inadequately capture the interaction between the user and the system. Consequently, they often correlate poorly with human evaluations. To address this issue, we propose a novel model-agnostic approach that leverages Conditional Pointwise Mutual Information (C-PMI) to measure the turn-level interaction between the system and the user based on a given evaluation dimension. Experimental results on the widely used FED dialogue evaluation dataset demonstrate that our approach significantly improves the correlation with human judgment compared with existing evaluation systems. By replacing the negative log-likelihood-based scorer with our proposed C-PMI scorer, we achieve a relative 60.5% higher Spearman correlation on average for the FED evaluation metric. Our code is publicly available at https://github.com/renll/C-PMI.
翻译:现有的无参考轮次级聊天机器人评估指标未能充分捕获用户与系统之间的交互,导致其与人类评估的相关性通常较差。为解决这一问题,我们提出了一种新颖的、与模型无关的方法,即利用条件点互信息(C-PMI)基于给定评估维度来衡量系统与用户之间的轮次级交互。在广泛使用的FED对话评估数据集上的实验结果表明,与现有评估系统相比,我们的方法显著提升了与人类判断的相关性。通过将基于负对数似然的评分器替换为我们提出的C-PMI评分器,FED评估指标的平均斯皮尔曼相关系数相对提升了60.5%。我们的代码已公开于 https://github.com/renll/C-PMI。