Despite the recent advances in open-domain dialogue systems, building a reliable evaluation metric is still a challenging problem. Recent studies proposed learnable metrics based on classification models trained to distinguish the correct response. However, neural classifiers are known to make overly confident predictions for examples from unseen distributions. We propose DEnsity, which evaluates a response by utilizing density estimation on the feature space derived from a neural classifier. Our metric measures how likely a response would appear in the distribution of human conversations. Moreover, to improve the performance of DEnsity, we utilize contrastive learning to further compress the feature space. Experiments on multiple response evaluation datasets show that DEnsity correlates better with human evaluations than the existing metrics. Our code is available at https://github.com/ddehun/DEnsity.
翻译:尽管开放域对话系统近期取得了显著进展,构建可靠的评估指标仍是一个具有挑战性的问题。近期研究提出了基于分类模型的可学习指标,该类模型经过训练以区分正确响应。然而,神经分类器对来自未见分布的数据样本往往做出过度自信的预测。我们提出DEnsity方法,通过利用神经分类器特征空间的密度估计来评估响应。该指标衡量响应出现在人类对话分布中的可能性。此外,为提升DEnsity性能,我们采用对比学习进一步压缩特征空间。在多个响应评估数据集上的实验表明,DEnsity与人类评估的相关性优于现有指标。我们的代码开源地址为https://github.com/ddehun/DEnsity。