Despite the recent advances in open-domain dialogue systems, building a reliable evaluation metric is still a challenging problem. Recent studies proposed learnable metrics based on classification models trained to distinguish the correct response. However, neural classifiers are known to make overly confident predictions for examples from unseen distributions. We propose DEnsity, which evaluates a response by utilizing density estimation on the feature space derived from a neural classifier. Our metric measures how likely a response would appear in the distribution of human conversations. Moreover, to improve the performance of DEnsity, we utilize contrastive learning to further compress the feature space. Experiments on multiple response evaluation datasets show that DEnsity correlates better with human evaluations than the existing metrics. Our code is available at https://github.com/ddehun/DEnsity.
翻译:尽管开放域对话系统近年来取得了进展,但构建可靠的评估指标仍是一个具有挑战性的问题。近期研究提出了基于分类模型的可学习指标,这些模型经过训练以区分正确回应。然而,神经分类器在对来自未见分布的样本进行预测时,已知会做出过度自信的判断。我们提出DEnsity,该方法利用对神经分类器特征空间进行密度估计来评估回应。该指标衡量一个回应在人类对话分布中出现的可能性。此外,为提升DEnsity的性能,我们采用对比学习进一步压缩特征空间。在多个回应评估数据集上的实验表明,DEnsity与人工评估的相关性优于现有指标。我们的代码开源于 https://github.com/ddehun/DEnsity。