Crowdsourced labels play a crucial role in evaluating task-oriented dialogue systems (TDSs). Obtaining high-quality and consistent ground-truth labels from annotators presents challenges. When evaluating a TDS, annotators must fully comprehend the dialogue before providing judgments. Previous studies suggest using only a portion of the dialogue context in the annotation process. However, the impact of this limitation on label quality remains unexplored. This study investigates the influence of dialogue context on annotation quality, considering the truncated context for relevance and usefulness labeling. We further propose to use large language models (LLMs) to summarize the dialogue context to provide a rich and short description of the dialogue context and study the impact of doing so on the annotator's performance. Reducing context leads to more positive ratings. Conversely, providing the entire dialogue context yields higher-quality relevance ratings but introduces ambiguity in usefulness ratings. Using the first user utterance as context leads to consistent ratings, akin to those obtained using the entire dialogue, with significantly reduced annotation effort. Our findings show how task design, particularly the availability of dialogue context, affects the quality and consistency of crowdsourced evaluation labels.
翻译:众包标签在评估面向任务对话系统(TDSs)中起着关键作用。从标注者处获取高质量且一致的真实标签面临诸多挑战。在评估TDS时,标注者必须在做出判断前充分理解对话内容。以往研究建议在标注过程中仅使用部分对话上下文。然而,这种限制对标签质量的影响尚未得到充分探讨。本研究考察了对话上下文对标注质量的影响,考虑了截断上下文在相关性和有用性标注中的作用。我们进一步提出使用大语言模型(LLMs)总结对话上下文,以提供丰富且简洁的对话上下文描述,并研究这种做法对标注者表现的影响。减少上下文会导致更积极的评分。相反,提供完整对话上下文可产生更高质量的相关性评分,但会引入有用性评分的模糊性。使用用户第一个话语作为上下文可得到与使用完整对话相似的评分一致性,同时显著降低标注工作量。我们的研究结果表明,任务设计(特别是对话上下文的可用性)如何影响众包评估标签的质量和一致性。