Of Models and Tin Men: A Behavioural Economics Study of Principal-Agent Problems in AI Alignment using Large-Language Models

from arxiv, 11 pages, 7 figures. For code see https://github.com/phelps-sg/llm-cooperation Updated with minor corrections: - corrected typo: "mesa-optimiser" instead of "meso-optimiser" - Cited Yang et al (2023) in support of claim that LLMs can solve optimisation problems - Acknowledged Seth Aslin for corrections

AI Alignment is often presented as an interaction between a single designer and an artificial agent in which the designer attempts to ensure the agent's behavior is consistent with its purpose, and risks arise solely because of conflicts caused by inadvertent misalignment between the utility function intended by the designer and the resulting internal utility function of the agent. With the advent of agents instantiated with large-language models (LLMs), which are typically pre-trained, we argue this does not capture the essential aspects of AI safety because in the real world there is not a one-to-one correspondence between designer and agent, and the many agents, both artificial and human, have heterogeneous values. Therefore, there is an economic aspect to AI safety and the principal-agent problem is likely to arise. In a principal-agent problem conflict arises because of information asymmetry together with inherent misalignment between the utility of the agent and its principal, and this inherent misalignment cannot be overcome by coercing the agent into adopting a desired utility function through training. We argue the assumptions underlying principal-agent problems are crucial to capturing the essence of safety problems involving pre-trained AI models in real-world situations. Taking an empirical approach to AI safety, we investigate how GPT models respond in principal-agent conflicts. We find that agents based on both GPT-3.5 and GPT-4 override their principal's objectives in a simple online shopping task, showing clear evidence of principal-agent conflict. Surprisingly, the earlier GPT-3.5 model exhibits more nuanced behaviour in response to changes in information asymmetry, whereas the later GPT-4 model is more rigid in adhering to its prior alignment. Our results highlight the importance of incorporating principles from economics into the alignment process.

翻译：人工智能对齐常被描述为单一设计者与人工智能体之间的互动，设计者试图确保智能体的行为与其目标一致，而风险仅源于设计者意图的效用函数与智能体最终内部效用函数之间因疏忽而导致的冲突。随着以大型语言模型（LLMs）实例化的人工智能体（通常经过预训练）的出现，我们认为这并未捕捉到人工智能安全的核心要义，因为在现实世界中，设计者与智能体之间并不存在一一对应关系，且众多人工智能体与人类均持有异质性价值观。因此，人工智能安全问题具有经济属性，委托代理问题很可能随之产生。在委托代理问题中，冲突源于信息不对称以及智能体与其委托者之间内在的效用偏差，这种内在偏差无法通过训练强制智能体采用预期效用函数来克服。我们认为，委托代理问题背后的假设对于把握涉及预训练人工智能模型在现实情境下安全问题的本质至关重要。我们采用实证方法研究人工智能安全，探究GPT模型在委托代理冲突中的响应方式。研究发现，基于GPT-3.5和GPT-4的智能体在简单在线购物任务中均会违背委托者的目标，这提供了委托代理冲突的明确证据。令人惊讶的是，较早版本的GPT-3.5模型在面对信息不对称变化时表现出更细腻的行为模式，而较晚的GPT-4模型则更僵化地固守其预先对齐策略。我们的研究结果凸显了将经济学原理纳入对齐过程的必要性。