Automatically evaluating the quality of responses in open-domain dialogue systems is a challenging but crucial task. Current evaluation metrics often fail to align with human judgments, especially when assessing responses that are grammatically correct. To address this issue, we propose a novel metric, called CausalScore, which assesses the relevance of responses by measuring the causal strength between dialogue histories and responses. The causal strength is estimated by utilizing both unconditional dependence and conditional dependencies from the dialogue history to responses. We compare our metric with the existing competitive metrics in terms of their alignment with human judgements. Our experimental results demonstrate that CausalScore significantly surpasses existing state-of-the-art metrics by aligning better with human judgements. Additionally, we collect a new dialogue dataset CGDIALOG+ with human-annotated causal relations and a set of pairwise human judgements to facilitate the development of future automatic metrics.
翻译:自动评估开放域对话系统中响应质量是一项具有挑战性但至关重要的任务。现有评估指标常难以与人类判断保持一致,尤其在评估语法正确的响应时。为解决此问题,我们提出一种名为CausalScore的新型度量方法,该方法通过测量对话历史与响应之间的因果强度来评估响应相关性。因果强度的估计同时利用了从对话历史到响应的无条件依赖关系和条件依赖关系。我们将本度量方法与现有竞争性指标在与人类判断的一致性方面进行比较。实验结果表明,CausalScore通过更好地契合人类判断,显著超越了现有最先进的度量方法。此外,我们收集了包含人工标注因果关系的新型对话数据集CGDIALOG+以及一组成对人类判断数据,以促进未来自动度量方法的发展。