Task-based chatbots are software, typically embedded in real-world applications, that assist users in completing tasks through a conversational interface. As chatbots are gaining popularity, effectively assessing their quality has become crucial. Whereas traditional testing techniques fail to systematically exercise the conversational space of chatbots, several approaches specifically targeting chatbots have emerged from both industry and research. Although these techniques have shown advancements over the years, they still exhibit limitations, such as simplicity of the generated test scenarios and weakness in implemented oracles. In this paper, we conduct a confirmatory study to investigate such limitations by evaluating the effectiveness of state-of-the-art chatbot testing techniques on a curated selection of task-based chatbots from GitHub, developed using the most popular commercial and open-source platforms.
翻译:基于任务的聊天机器人是一种软件,通常嵌入在现实应用中,通过对话界面协助用户完成任务。随着聊天机器人日益普及,有效评估其质量变得至关重要。尽管传统测试技术无法系统性地覆盖聊天机器人的对话空间,但业界和学术界已涌现出多种专门针对聊天机器人的测试方法。虽然这些技术多年来已取得进展,但仍存在局限性,例如生成的测试场景过于简单,以及所实现的预言机制存在缺陷。本文通过一项验证性研究,利用从GitHub精选的、基于最流行商业和开源平台开发的若干任务型聊天机器人,评估前沿聊天机器人测试技术的有效性,从而深入探究这些局限性。