Moral Alignment for LLM Agents

Decision-making agents based on pre-trained Large Language Models (LLMs) are increasingly being deployed across various domains of human activity. While their applications are currently rather specialized, several research efforts are under way to develop more generalist agents. As LLM-based systems become more agentic, their influence on human activity will grow and the transparency of this will decrease. Consequently, developing effective methods for aligning them to human values is vital. The prevailing practice in alignment often relies on human preference data (e.g., in RLHF or DPO), in which values are implicit and are essentially deduced from relative preferences over different model outputs. In this work, instead of relying on human feedback, we introduce the design of reward functions that explicitly encode core human values for Reinforcement Learning-based fine-tuning of foundation agent models. Specifically, we use intrinsic rewards for the moral alignment of LLM agents. We evaluate our approach using the traditional philosophical frameworks of Deontological Ethics and Utilitarianism, quantifying moral rewards for agents in terms of actions and consequences on the Iterated Prisoner's Dilemma (IPD) environment. We also show how moral fine-tuning can be deployed to enable an agent to unlearn a previously developed selfish strategy. Finally, we find that certain moral strategies learned on the IPD game generalize to several other matrix game environments. In summary, we demonstrate that fine-tuning with intrinsic rewards is a promising general solution for aligning LLM agents to human values, and it might represent a more transparent and cost-effective alternative to currently predominant alignment techniques.

翻译：基于预训练大型语言模型（LLMs）的决策智能体正日益广泛地部署于人类活动的各个领域。尽管当前其应用仍较为专业化，但多项研究正致力于开发更具通用性的智能体。随着基于LLM的系统自主性日益增强，其对人类活动的影响将不断扩大，而这一过程的透明度却会降低。因此，开发有效的方法使其与人类价值观对齐至关重要。当前主流对齐实践通常依赖于人类偏好数据（例如在RLHF或DPO中），其中价值观是隐式的，本质上通过不同模型输出的相对偏好推导得出。本研究摒弃对人类反馈的依赖，引入显式编码核心人类价值观的奖励函数设计，用于基于强化学习的基础智能体模型微调。具体而言，我们采用内在奖励机制实现LLM智能体的道德对齐。我们运用传统哲学框架——义务论伦理学与功利主义——在迭代囚徒困境（IPD）环境中，通过智能体的行为与后果量化道德奖励。我们还展示了道德微调如何使智能体摒弃先前形成的自私策略。最后，我们发现IPD博弈中习得的某些道德策略可泛化至多种其他矩阵博弈环境。综上所述，我们证明基于内在奖励的微调是实现LLM智能体与人类价值观对齐的通用解决方案，相较于当前主流对齐技术，该方法可能具备更高透明度与成本效益。