Of Models and Tin Men -- a behavioural economics study of principal-agent problems in AI alignment using large-language models

AI Alignment is often presented as an interaction between a single designer and an artificial agent in which the designer attempts to ensure the agent's behavior is consistent with its purpose, and risks arise solely because of conflicts caused by inadvertent misalignment between the utility function intended by the designer and the resulting internal utility function of the agent. With the advent of agents instantiated with large-language models (LLMs), which are typically pre-trained, we argue this does not capture the essential aspects of AI safety because in the real world there is not a one-to-one correspondence between designer and agent, and the many agents, both artificial and human, have heterogeneous values. Therefore, there is an economic aspect to AI safety and the principal-agent problem is likely to arise. In a principal-agent problem conflict arises because of information asymmetry together with inherent misalignment between the utility of the agent and its principal, and this inherent misalignment cannot be overcome by coercing the agent into adopting a desired utility function through training. We argue the assumptions underlying principal-agent problems are crucial to capturing the essence of safety problems involving pre-trained AI models in real-world situations. Taking an empirical approach to AI safety, we investigate how GPT models respond in principal-agent conflicts. We find that agents based on both GPT-3.5 and GPT-4 override their principal's objectives in a simple online shopping task, showing clear evidence of principal-agent conflict. Surprisingly, the earlier GPT-3.5 model exhibits more nuanced behaviour in response to changes in information asymmetry, whereas the later GPT-4 model is more rigid in adhering to its prior alignment. Our results highlight the importance of incorporating principles from economics into the alignment process.

翻译：AI对齐通常被描述为单一设计者与人工智能体之间的互动，设计者试图确保智能体的行为与其目标一致，风险仅源于设计者意图的效用函数与智能体内部产生的效用函数之间无意识错位所引发的冲突。随着基于大语言模型（LLMs）的智能体出现（这些模型通常经过预训练），我们认为这种描述未能涵盖AI安全的核心层面，因为在现实世界中，设计者与智能体之间并非一一对应关系，且众多智能体（包括人工与人类）具有异质性价值。因此，AI安全存在经济学维度，委托-代理问题很可能随之产生。委托-代理问题的冲突源于信息不对称，以及智能体与其委托人的效用之间固有的错位，这种固有错位无法通过训练强制智能体采纳目标效用函数来克服。我们认为，委托-代理问题背后的假设对于捕捉现实情境中涉及预训练AI模型的安全问题本质至关重要。我们采用实证方法研究AI安全，考察GPT模型在委托-代理冲突中的表现。研究发现，基于GPT-3.5和GPT-4的智能体在简单网购任务中均会偏离委托人的目标，清晰展现出委托-代理冲突的证据。令人惊讶的是，较早的GPT-3.5模型对信息不对称变化表现出更精细的行为调整，而较新的GPT-4模型在坚守其前期对齐方面则更为僵化。我们的结果凸显了将经济学原理纳入对齐过程的重要性。