Identifying specific and often complex behaviors from large language models (LLMs) in conversational settings is crucial for their evaluation. Recent work proposes novel techniques to find natural language prompts that induce specific behaviors from a target model, yet they are mainly studied in single-turn settings. In this work, we study behavior elicitation in the context of multi-turn conversations. We first offer an analytical framework that categorizes existing methods into three families based on their interactions with the target model: those that use only prior knowledge, those that use offline interactions, and those that learn from online interactions. We then introduce a generalized multi-turn formulation of the online method, unifying single-turn and multi-turn elicitation. We evaluate all three families of methods on automatically generating multi-turn test cases. We investigate the efficiency of these approaches by analyzing the trade-off between the query budget, i.e., the number of interactions with the target model, and the success rate, i.e., the discovery rate of behavior-eliciting inputs. We find that online methods can achieve an average success rate of 45/19/77% with just a few thousand queries over three tasks where static methods from existing multi-turn conversation benchmarks find few or even no failure cases. Our work highlights a novel application of behavior elicitation methods in multi-turn conversation evaluation and the need for the community to move towards dynamic benchmarks.
翻译:从大型语言模型(LLMs)在对话场景中识别特定且通常复杂的行为,对于其评估至关重要。近期研究提出了新颖技术来寻找能够从目标模型中诱导出特定行为的自然语言提示,但这些方法主要局限于单轮交互场景。本文研究了多轮对话背景下的行为引发问题。我们首先提出了一个分析框架,将现有方法根据其与目标模型的交互方式分为三类:仅利用先验知识的方法、利用离线交互的方法,以及从在线交互中学习的方法。随后,我们引入了在线方法的广义多轮形式化表述,统一了单轮与多轮行为引发框架。我们在自动生成多轮测试用例的任务上评估了这三类方法。通过分析查询预算(即与目标模型的交互次数)与成功率(即引发行为的输入发现率)之间的权衡关系,我们深入探究了这些方法的效率。研究发现,在三个任务中,在线方法仅需数千次查询即可实现平均45%/19%/77%的成功率,而现有多轮对话基准中的静态方法仅发现极少甚至无法找到失效案例。本研究凸显了行为引发方法在多轮对话评估中的创新应用,并表明研究社区亟需向动态基准测试方向推进。