Moral Alignment for LLM Agents

Decision-making agents based on pre-trained Large Language Models (LLMs) are increasingly being deployed across various domains of human activity. While their applications are currently rather specialized, several research efforts are under way to develop more generalist agents. As LLM-based systems become more agentic, their influence on human activity will grow and the transparency of this will decrease. Consequently, developing effective methods for aligning them to human values is vital. The prevailing practice in alignment often relies on human preference data (e.g., in RLHF or DPO), in which values are implicit and are essentially deduced from relative preferences over different model outputs. In this work, instead of relying on human feedback, we introduce the design of reward functions that explicitly encode core human values for Reinforcement Learning-based fine-tuning of foundation agent models. Specifically, we use intrinsic rewards for the moral alignment of LLM agents. We evaluate our approach using the traditional philosophical frameworks of Deontological Ethics and Utilitarianism, quantifying moral rewards for agents in terms of actions and consequences on the Iterated Prisoner's Dilemma (IPD) environment. We also show how moral fine-tuning can be deployed to enable an agent to unlearn a previously developed selfish strategy. Finally, we find that certain moral strategies learned on the IPD game generalize to several other matrix game environments. In summary, we demonstrate that fine-tuning with intrinsic rewards is a promising general solution for aligning LLM agents to human values, and it might represent a more transparent and cost-effective alternative to currently predominant alignment techniques.

翻译：基于预训练大型语言模型（LLMs）的决策智能体正日益广泛地部署于人类活动的各个领域。尽管当前其应用仍较为专业化，但多项研究正在致力于开发更具通用性的智能体。随着基于LLM的系统自主性增强，其对人类活动的影响将日益扩大，而这一过程的透明度却会降低。因此，开发有效的方法使其与人类价值观对齐至关重要。当前主流对齐实践通常依赖于人类偏好数据（例如在RLHF或DPO中），其中价值观是隐式的，本质上通过不同模型输出的相对偏好推导得出。本研究摒弃对人类反馈的依赖，提出了为基于强化学习的基础智能体模型微调而设计的、能显式编码人类核心价值观的奖励函数。具体而言，我们采用内在奖励机制实现LLM智能体的道德对齐。我们运用道义论伦理学与功利主义这两种传统哲学框架进行评估，在迭代囚徒困境（IPD）环境中通过智能体的行为与后果量化道德奖励。我们还展示了道德微调如何使智能体摒弃先前形成的自私策略。最后，我们发现IPD博弈中习得的某些道德策略可泛化至其他矩阵博弈环境。综上所述，我们证明采用内在奖励的微调方法是对齐LLM智能体与人类价值观的通用解决方案，相较于当前主流对齐技术，该方法可能具有更高透明度与成本效益。