Most task-oriented dialogue (TOD) benchmarks assume users that know exactly how to use the system by constraining the user behaviors within the system's capabilities via strict user goals, namely "user familiarity" bias. This data bias deepens when it combines with data-driven TOD systems, as it is impossible to fathom the effect of it with existing static evaluations. Hence, we conduct an interactive user study to unveil how vulnerable TOD systems are against realistic scenarios. In particular, we compare users with 1) detailed goal instructions that conform to the system boundaries (closed-goal) and 2) vague goal instructions that are often unsupported but realistic (open-goal). Our study reveals that conversations in open-goal settings lead to catastrophic failures of the system, in which 92% of the dialogues had significant issues. Moreover, we conduct a thorough analysis to identify distinctive features between the two settings through error annotation. From this, we discover a novel "pretending" behavior, in which the system pretends to handle the user requests even though they are beyond the system's capabilities. We discuss its characteristics and toxicity while showing recent large language models can also suffer from this behavior.
翻译:大多数任务导向对话(TOD)基准测试通过严格的用户目标将用户行为限制在系统能力范围内,从而假设用户完全清楚如何使用系统,即存在“用户熟悉度”偏差。当这种数据偏差与数据驱动的TOD系统结合时,问题会进一步加剧,因为现有的静态评估方法无法充分揭示其影响。因此,我们开展了一项交互式用户研究,以揭示TOD系统在现实场景下的脆弱性。具体而言,我们比较了两种用户:1)获得符合系统边界的详细目标指令(封闭目标);2)获得通常不受支持但更贴近现实的模糊目标指令(开放目标)。我们的研究表明,在开放目标设定下的对话会导致系统出现灾难性故障,其中92%的对话存在严重问题。此外,我们通过错误标注进行了深入分析,以识别两种设定之间的显著特征。由此,我们发现了一种新型的“伪装”行为,即系统在用户请求超出其能力范围时仍假装能够处理。我们讨论了该行为的特征与危害性,并指出近期的大型语言模型同样可能受此行为影响。