Traditionally, offline datasets have been used to evaluate task-oriented dialogue (TOD) models. These datasets lack context awareness, making them suboptimal benchmarks for conversational systems. In contrast, user-agents, which are context-aware, can simulate the variability and unpredictability of human conversations, making them better alternatives as evaluators. Prior research has utilized large language models (LLMs) to develop user-agents. Our work builds upon this by using LLMs to create user-agents for the evaluation of TOD systems. This involves prompting an LLM, using in-context examples as guidance, and tracking the user-goal state. Our evaluation of diversity and task completion metrics for the user-agents shows improved performance with the use of better prompts. Additionally, we propose methodologies for the automatic evaluation of TOD models within this dynamic framework.
翻译:传统上,面向任务对话(TOD)模型的评估依赖于离线数据集。这些数据集缺乏上下文感知能力,使其作为对话系统的基准测试存在不足。相比之下,具有上下文感知能力的用户代理能够模拟人类对话的多样性和不可预测性,因此成为更优的评估替代方案。先前的研究已利用大型语言模型(LLMs)开发用户代理。本研究在此基础上,使用LLMs创建用户代理以评估TOD系统。该方法涉及提示LLM、使用上下文示例作为指导,并跟踪用户目标状态。我们对用户代理的多样性和任务完成度指标的评估表明,使用更优提示能提升其性能。此外,我们提出了在此动态框架内自动评估TOD模型的方法论。