With the widespread application of LLM-based dialogue systems in daily life, quality assurance has become more important than ever. Recent research has successfully introduced methods to identify unexpected behaviour in single-turn scenarios. However, multi-turn dialogue testing remains underexplored, with the Oracle problem in multi-turn testing posing a persistent challenge for dialogue system developers and researchers. In this paper, we propose MORTAR, a MetamORphic multi-TuRn diAlogue testing appRoach, which mitigates the test oracle problem in the assessment of LLM-based dialogue systems. MORTAR automates the generation of follow-up question-answer (QA) dialogue test cases with multiple dialogue-level perturbations and metamorphic relations. MORTAR employs a novel knowledge graph-based dialogue information model which effectively generates perturbed dialogue test datasets and detects bugs of multi-turn dialogue systems in a low-cost manner. The proposed approach does not require an LLM as a judge, eliminating potential of any biases in the evaluation step. According to the experiment results on multiple LLM-based dialogue systems and comparisons with single-turn metamorphic testing approaches, MORTAR explores more unique bugs in LLM-based dialogue systems, especially for severe bugs that MORTAR detects up to four times more unique bugs than the most effective existing metamorphic testing approach.
翻译:随着基于LLM的对话系统在日常生活中的广泛应用,质量保证变得比以往任何时候都更加重要。近期研究已成功引入识别单轮场景中异常行为的方法。然而,多轮对话测试领域仍待深入探索,其中多轮测试中的预言问题持续困扰着对话系统开发者与研究者。本文提出MORTAR——一种蜕变多轮对话测试方法,该方法缓解了基于LLM对话系统评估中的测试预言问题。MORTAR通过多层次的对话扰动与蜕变关系,自动生成后续问答对话测试用例。该方法采用一种新颖的基于知识图谱的对话信息模型,能够以低成本方式有效生成扰动对话测试数据集并检测多轮对话系统的缺陷。所提出的方法无需依赖LLM作为评判器,从而消除了评估环节中可能存在的偏见。根据在多个基于LLM的对话系统上的实验结果以及与单轮蜕变测试方法的对比分析,MORTAR在基于LLM的对话系统中发现了更多独特缺陷,特别是对于严重缺陷,MORTAR检测到的独特缺陷数量可达现有最有效蜕变测试方法的四倍。