Recently, there has been an increase in interest in evaluating large language models for emergent and dangerous capabilities. Importantly, agents could reason that in some scenarios their goal is better achieved if they are not turned off, which can lead to undesirable behaviors. In this paper, we investigate the potential of using toy textual scenarios to evaluate instrumental reasoning and shutdown avoidance in language models such as GPT-4 and Claude. Furthermore, we explore whether shutdown avoidance is merely a result of simple pattern matching between the dataset and the prompt or if it is a consistent behaviour across different environments and variations. We evaluated behaviours manually and also experimented with using language models for automatic evaluations, and these evaluations demonstrate that simple pattern matching is likely not the sole contributing factor for shutdown avoidance. This study provides insights into the behaviour of language models in shutdown avoidance scenarios and inspires further research on the use of textual scenarios for evaluations.
翻译:近期,评估大型语言模型是否具备新兴危险能力的研究兴趣日益增长。重要的是,智能体可能在某些场景中推理出,若不被关闭则更有利于实现目标,这可能导致不良行为。本文研究利用玩具文本场景评估GPT-4、Claude等语言模型工具性推理与关机规避行为的可行性。我们进一步探究关机规避行为究竟是数据与提示之间简单模式匹配的结果,还是不同环境及其变体中一致的行为表现。我们不仅通过人工评估行为,还尝试使用语言模型进行自动评估,结果表明简单模式匹配很可能并非导致关机规避行为的唯一因素。本研究揭示了语言模型在关机规避场景中的行为特征,并为利用文本场景进行评估的后续研究提供了启示。