Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs

In ad-hoc retrieval, evaluation relies heavily on user actions, including implicit feedback. In a conversational setting such signals are usually unavailable due to the nature of the interactions, and, instead, the evaluation often relies on crowdsourced evaluation labels. The role of user feedback in annotators' assessment of turns in a conversational perception has been little studied. We focus on how the evaluation of task-oriented dialogue systems (TDSs), is affected by considering user feedback, explicit or implicit, as provided through the follow-up utterance of a turn being evaluated. We explore and compare two methodologies for assessing TDSs: one includes the user's follow-up utterance and one without. We use both crowdworkers and large language models (LLMs) as annotators to assess system responses across four aspects: relevance, usefulness, interestingness, and explanation quality. Our findings indicate that there is a distinct difference in ratings assigned by both annotator groups in the two setups, indicating user feedback does influence system evaluation. Workers are more susceptible to user feedback on usefulness and interestingness compared to LLMs on interestingness and relevance. User feedback leads to a more personalized assessment of usefulness by workers, aligning closely with the user's explicit feedback. Additionally, in cases of ambiguous or complex user requests, user feedback improves agreement among crowdworkers. These findings emphasize the significance of user feedback in refining system evaluations and suggest the potential for automated feedback integration in future research. We publicly release the annotated data to foster research in this area.

翻译：在即席检索中，评估高度依赖用户行为（包括隐式反馈）。然而在对话式场景中，由于交互性质的特殊性，这类信号通常难以获取，评估往往转而依赖众包标注标签。关于用户反馈如何影响标注者对对话交互中话轮的感知评估，相关研究尚属空白。本研究聚焦于任务导向型对话系统的评估如何受用户反馈影响——无论是显式反馈还是隐式反馈，这些反馈通过被评估话轮的后继话语呈现。我们探索并比较两种TDS评估方法：一种包含用户后续话语，另一种则不包含。我们同时采用众包工作者和大型语言模型作为标注者，从相关性、有用性、趣味性和解释质量四个维度对系统响应进行评估。研究结果表明，两种标注群体在这两种设置下给出的评分存在显著差异，证明用户反馈确实会影响系统评估。工作者在有用性和趣味性维度上较易受用户反馈影响，而LLM在趣味性和相关性维度上更易受反馈影响。用户反馈促使工作者对有用性形成更个性化的评估，与用户的显式反馈高度一致。此外，针对模糊或复杂的用户请求，用户反馈能提升众包工作者之间的一致性。这些发现凸显了用户反馈在优化系统评估中的重要性，并预示着未来研究可探索自动化反馈集成的可能性。我们公开了标注数据以促进该领域研究。