Measurement of interaction quality is a critical task for the improvement of spoken dialog systems. Existing approaches to dialog quality estimation either focus on evaluating the quality of individual turns, or collect dialog-level quality measurements from end users immediately following an interaction. In contrast to these approaches, we introduce a new dialog-level annotation workflow called Dialog Quality Annotation (DQA). DQA expert annotators evaluate the quality of dialogs as a whole, and also label dialogs for attributes such as goal completion and user sentiment. In this contribution, we show that: (i) while dialog quality cannot be completely decomposed into dialog-level attributes, there is a strong relationship between some objective dialog attributes and judgments of dialog quality; (ii) for the task of dialog-level quality estimation, a supervised model trained on dialog-level annotations outperforms methods based purely on aggregating turn-level features; and (iii) the proposed evaluation model shows better domain generalization ability compared to the baselines. On the basis of these results, we argue that having high-quality human-annotated data is an important component of evaluating interaction quality for large industrial-scale voice assistant platforms.
翻译:摘要:交互质量测量是提升口语对话系统的关键任务。现有对话质量评估方法或聚焦于单轮交互质量评价,或通过用户在交互结束后即时收集对话级质量指标。与此不同,我们提出了一种名为对话质量标注(DQA)的新型对话级标注流程。DQA专家标注员从整体维度评估对话质量,同时标注目标完成度、用户情感等属性。本研究证明:(i)尽管对话质量无法完全分解为对话级属性,但某些客观对话属性与质量判断之间存在强关联;(ii)在对话级质量估计任务中,基于对话级标注训练的监督模型优于纯聚合轮级特征的方法;(iii)与基线方法相比,所提出的评估模型展现出更强的领域泛化能力。基于这些发现,我们认为高质量人工标注数据是大规模工业级语音助手平台评估交互质量的重要组成。