Current metrics for evaluating Dialogue State Tracking (DST) systems exhibit three primary limitations. They: i) erroneously presume a uniform distribution of slots throughout the dialog, ii) neglect to assign partial scores for individual turns, iii) frequently overestimate or underestimate performance by repeatedly counting the models' successful or failed predictions. To address these shortcomings, we introduce a novel metric: Granular Change Accuracy (GCA). GCA focuses on evaluating the predicted changes in dialogue state over the entire dialogue history. Benchmarking reveals that GCA effectively reduces biases arising from distribution uniformity and the positioning of errors across turns, resulting in a more precise evaluation. Notably, we find that these biases are particularly pronounced when evaluating few-shot or zero-shot trained models, becoming even more evident as the model's error rate increases. Hence, GCA offers significant promise, particularly for assessing models trained with limited resources. Our GCA implementation is a useful addition to the pool of DST metrics.
翻译:当前的对话状态跟踪(DST)系统评估指标存在三个主要局限性:i)错误地假设对话中槽位均匀分布;ii)忽略对单个轮次分配部分分数;iii)通过重复计算模型成功或失败的预测,经常高估或低估性能。为了解决这些不足,我们引入了一种新型指标:粒度变化准确率(GCA)。GCA专注于评估整个对话历史上对话状态的预测变化。基准测试表明,GCA有效减少了因分布均匀性和轮次错误定位带来的偏差,从而实现更精确的评估。值得注意的是,我们发现在评估少样本或零样本训练的模型时,这些偏差尤为显著,且随着模型错误率增加而变得更加明显。因此,GCA在评估资源有限训练的模型方面具有显著前景。我们的GCA实现为DST指标池提供了一个有用的补充。