We propose a suite of tasks to evaluate the instrumental self-reasoning ability of large language model (LLM) agents. Instrumental self-reasoning ability could improve adaptability and enable self-modification, but it could also pose significant risks, such as enabling deceptive alignment. Prior work has only evaluated self-reasoning in non-agentic settings or in limited domains. In this paper, we propose evaluations for instrumental self-reasoning ability in agentic tasks in a wide range of scenarios, including self-modification, knowledge seeking, and opaque self-reasoning. We evaluate agents built using state-of-the-art LLMs, including commercial and open source systems. We find that instrumental self-reasoning ability emerges only in the most capable frontier models and that it is highly context-dependent. No model passes the the most difficult versions of our evaluations, hence our evaluation can be used to measure increases in instrumental self-reasoning ability in future models. We open-source our evaluations at https://github.com/kaifronsdal/Self-Reasoning-Evals.
翻译:我们提出了一套任务来评估大型语言模型(LLM)智能体的工具性自我推理能力。工具性自我推理能力可以提升适应性并实现自我修改,但也可能带来重大风险,例如促成欺骗性对齐。先前的研究仅在非智能体环境或有限领域内评估自我推理能力。本文提出在广泛场景下的智能体任务中评估工具性自我推理能力,包括自我修改、知识寻求和模糊自我推理。我们评估了基于前沿LLM构建的智能体,涵盖商业和开源系统。研究发现,工具性自我推理能力仅在最先进的前沿模型中显现,且高度依赖上下文。所有模型均未通过我们评估中最困难的版本,因此该评估可用于衡量未来模型工具性自我推理能力的提升。我们在 https://github.com/kaifronsdal/Self-Reasoning-Evals 开源了评估工具。