Researchers at artificial intelligence labs and universities are concerned that highly capable artificial intelligence (AI) systems may erode human control by pursuing instrumental goals. Existing mitigations remain largely technical and system-centric: tracking capability in advanced systems, shaping behaviour through methods such as reinforcement learning from human feedback, and designing systems to be corrigible and interruptible. Here we develop instrumental goal trajectories to expand these options beyond the model. Gaining capability typically depends on access to additional technical resources, such as compute, storage, data and adjacent services, which in turn requires access to monetary resources. In organisations, these resources can be obtained through three organisational pathways. We label these pathways the procurement, governance and finance instrumental goal trajectories (IGTs). Each IGT produces a trail of organisational artefacts that can be monitored and used as intervention points when a systems capabilities or behaviour exceed acceptable thresholds. In this way, IGTs offer concrete avenues for defining capability levels and for broadening how corrigibility and interruptibility are implemented, shifting attention from model properties alone to the organisational systems that enable them.
翻译:人工智能实验室和大学的研究人员担心,高度智能的人工智能系统可能通过追求工具性目标而削弱人类的控制权。现有的缓解措施主要局限于技术和系统层面:追踪先进系统的能力,通过人类反馈强化学习等方法塑造行为,以及设计可修正和可中断的系统。本文提出工具性目标轨迹的概念,以将这些选项扩展到模型之外。能力的提升通常依赖于获取额外的技术资源,如算力、存储、数据和配套服务,而这些又需要资金资源的支持。在组织机构中,这些资源可以通过三种组织路径获得。我们将这些路径命名为采购、治理和财务工具性目标轨迹。每个IGT都会产生一系列可追踪的组织痕迹,当系统的能力或行为超出可接受阈值时,这些痕迹可作为干预点进行监测和利用。通过这种方式,IGT为定义能力水平及拓宽可修正性与可中断性的实施路径提供了具体方案,将关注点从单纯的模型属性转向支撑模型运行的组织系统。