In this report, we explore the ability of language model agents to acquire resources, create copies of themselves, and adapt to novel challenges they encounter in the wild. We refer to this cluster of capabilities as "autonomous replication and adaptation" or ARA. We believe that systems capable of ARA could have wide-reaching and hard-to-anticipate consequences, and that measuring and forecasting ARA may be useful for informing measures around security, monitoring, and alignment. Additionally, once a system is capable of ARA, placing bounds on a system's capabilities may become significantly more difficult. We construct four simple example agents that combine language models with tools that allow them to take actions in the world. We then evaluate these agents on 12 tasks relevant to ARA. We find that these language model agents can only complete the easiest tasks from this list, although they make some progress on the more challenging tasks. Unfortunately, these evaluations are not adequate to rule out the possibility that near-future agents will be capable of ARA. In particular, we do not think that these evaluations provide good assurance that the ``next generation'' of language models (e.g. 100x effective compute scaleup on existing models) will not yield agents capable of ARA, unless intermediate evaluations are performed during pretraining. Relatedly, we expect that fine-tuning of the existing models could produce substantially more competent agents, even if the fine-tuning is not directly targeted at ARA.
翻译:本报告探究语言模型智能体获取资源、自我复制以及适应野外新挑战的能力。我们将这类能力统称为「自主复制与适应」(ARA)。我们认为具备ARA能力的系统可能产生广泛且难以预料的后果,而测量与预测ARA将有助于制定安全、监控与对齐等领域的应对措施。此外,一旦系统具备ARA能力,对其能力边界进行约束将变得极为困难。我们构建了四个简单示例智能体,这些智能体将语言模型与可执行现实操作的工具相结合。随后针对ARA相关的12项任务对这些智能体进行评测,发现当前语言模型智能体仅能完成列表中最简单的任务(尽管在更具挑战性的任务上有所进展)。遗憾的是,这些评测不足以排除近期出现的智能体具备ARA能力的可能性。特别需要指出的是,除非在预训练过程中进行中间评估,否则我们认为现有评测无法为「下一代」语言模型(例如将现有模型的有效计算规模扩大100倍)不会催生具备ARA能力的智能体提供充分保证。与此相关,我们预计即使不直接针对ARA进行微调,对现有模型进行微调也可能产生能力大幅提升的智能体。