We investigate the question: if an AI agent is known to be safe in one setting, is it also safe in a new setting similar to the first? This is a core question of AI alignment--we train and test models in a certain environment, but deploy them in another, and we need to guarantee that models that seem safe in testing remain so in deployment. Our notion of safety is based on power-seeking--an agent which seeks power is not safe. In particular, we focus on a crucial type of power-seeking: resisting shutdown. We model agents as policies for Markov decision processes, and show (in two cases of interest) that not resisting shutdown is "stable": if an MDP has certain policies which don't avoid shutdown, the corresponding policies for a similar MDP also don't avoid shutdown. We also show that there are natural cases where safety is _not_ stable--arbitrarily small perturbations may result in policies which never shut down. In our first case of interest--near-optimal policies--we use a bisimulation metric on MDPs to prove that small perturbations won't make the agent take longer to shut down. Our second case of interest is policies for MDPs satisfying certain constraints which hold for various models (including language models). Here, we demonstrate a quantitative bound on how fast the probability of not shutting down can increase: by defining a metric on MDPs; proving that the probability of not shutting down, as a function on MDPs, is lower semicontinuous; and bounding how quickly this function decreases.
翻译:我们研究以下问题:如果一个AI智能体在某种环境下被证明是安全的,那么在与之相似的新环境中它是否依然安全?这是AI对齐的核心问题——我们在特定环境中训练和测试模型,却在另一个环境中部署它们,需要确保看似安全的模型在测试阶段到部署阶段始终可靠。我们以权力追求行为作为安全性的定义基础——追求权力的智能体是不安全的。特别地,我们聚焦于一种关键形式的权力追求:抵抗停机。我们将智能体建模为马尔可夫决策过程的策略,证明(在两类重要情形中)不抵抗停机具有"稳定性":若某个MDP存在某些不采取规避停机行为的策略,则相似MDP对应的策略同样不会规避停机。同时我们也展示了存在自然情形下安全特性_不_具有稳定性——任意微小的扰动可能导致策略永不停机。在第一个重要情形(近最优策略)中,我们利用MDP上的互模拟度量证明:微小扰动不会使智能体停机时间变长。第二个重要情形涉及满足特定约束条件的MDP策略(这些约束适用于包括语言模型在内的多种模型)。在此我们给出了停机概率增长速率的定量边界:通过定义MDP度量空间;证明停机概率作为MDP的函数具有下半连续性;并界定了该函数的衰减速率上限。