Measurement of interaction quality is a critical task for the improvement of spoken dialog systems. Existing approaches to dialog quality estimation either focus on evaluating the quality of individual turns, or collect dialog-level quality measurements from end users immediately following an interaction. In contrast to these approaches, we introduce a new dialog-level annotation workflow called Dialog Quality Annotation (DQA). DQA expert annotators evaluate the quality of dialogs as a whole, and also label dialogs for attributes such as goal completion and user sentiment. In this contribution, we show that: (i) while dialog quality cannot be completely decomposed into dialog-level attributes, there is a strong relationship between some objective dialog attributes and judgments of dialog quality; (ii) for the task of dialog-level quality estimation, a supervised model trained on dialog-level annotations outperforms methods based purely on aggregating turn-level features; and (iii) the proposed evaluation model shows better domain generalization ability compared to the baselines. On the basis of these results, we argue that having high-quality human-annotated data is an important component of evaluating interaction quality for large industrial-scale voice assistant platforms.
翻译:交互质量的测量是提升口语对话系统的关键任务。现有对话质量评估方法或聚焦于单轮次质量评价,或通过对话结束后直接收集终端用户的整体质量反馈。与这些方法不同,我们提出了一种名为对话质量标注(Dialog Quality Annotation, DQA)的新型整体标注工作流。DQA专家标注员不仅对对话的整体质量进行评估,还标注目标完成度、用户情感等属性维度。本研究揭示:(i)对话质量虽无法完全分解为整体属性指标,但部分客观属性与对话质量判断存在强关联;(ii)在整体质量评估任务中,基于整体标注训练的有监督模型优于单纯聚合轮次级特征的方法;(iii)相较于基线模型,所提评估方法展现出更强的领域泛化能力。基于上述发现,我们认为高质量人工标注数据是大规模工业级语音助手平台交互质量评估的关键组成部分。