Task-oriented dialogue systems rely on predefined conversation schemes (dialogue flows) often represented as directed acyclic graphs. These flows can be manually designed or automatically generated from previously recorded conversations. Due to variations in domain expertise or reliance on different sets of prior conversations, these dialogue flows can manifest in significantly different graph structures. Despite their importance, there is no standard method for evaluating the quality of dialogue flows. We introduce FuDGE (Fuzzy Dialogue-Graph Edit Distance), a novel metric that evaluates dialogue flows by assessing their structural complexity and representational coverage of the conversation data. FuDGE measures how well individual conversations align with a flow and, consequently, how well a set of conversations is represented by the flow overall. Through extensive experiments on manually configured flows and flows generated by automated techniques, we demonstrate the effectiveness of FuDGE and its evaluation framework. By standardizing and optimizing dialogue flows, FuDGE enables conversational designers and automated techniques to achieve higher levels of efficiency and automation.
翻译:任务型对话系统通常依赖预定义的对话方案(对话流程),这些方案常表示为有向无环图。这些流程可通过人工设计或基于先前记录的对话自动生成。由于领域专业知识的差异或所依赖的历史对话集不同,这些对话流程可能呈现出显著不同的图结构。尽管其重要性不言而喻,目前尚无评估对话流程质量的标准方法。本文提出FuDGE(模糊对话图编辑距离),这是一种通过评估对话流程的结构复杂性及其对对话数据的表征覆盖度来衡量其质量的新颖指标。FuDGE能够度量单次对话与流程的匹配程度,进而评估一组对话在整体上被流程表征的充分性。通过对人工配置的流程与自动化技术生成的流程进行大量实验,我们验证了FuDGE及其评估框架的有效性。通过标准化和优化对话流程,FuDGE使对话设计师和自动化技术能够实现更高水平的效率与自动化程度。