AI alignment is a pivotal issue concerning AI control and safety. It should consider not only value-neutral human preferences but also moral and ethical considerations. In this study, we introduced FairMindSim, which simulates the moral dilemma through a series of unfair scenarios. We used LLM agents to simulate human behavior, ensuring alignment across various stages. To explore the various socioeconomic motivations, which we refer to as beliefs, that drive both humans and LLM agents as bystanders to intervene in unjust situations involving others, and how these beliefs interact to influence individual behavior, we incorporated knowledge from relevant sociological fields and proposed the Belief-Reward Alignment Behavior Evolution Model (BREM) based on the recursive reward model (RRM). Our findings indicate that, behaviorally, GPT-4o exhibits a stronger sense of social justice, while humans display a richer range of emotions. Additionally, we discussed the potential impact of emotions on behavior. This study provides a theoretical foundation for applications in aligning LLMs with altruistic values.
翻译:人工智能对齐是关乎AI控制与安全的核心议题,其考量不仅应包含价值中立的人类偏好,还需纳入道德与伦理维度。本研究提出FairMindSim框架,通过一系列不公平情境模拟道德困境。我们采用LLM智能体模拟人类行为,确保各阶段的对齐一致性。为探究驱动人类与LLM智能体作为旁观者干预他人遭遇不公情境的各类社会经济动机(本研究称之为信念),以及这些信念如何相互作用影响个体行为,我们整合了相关社会学领域知识,基于递归奖励模型(RRM)提出了信念-奖励对齐行为演化模型(BREM)。研究结果表明:在行为层面,GPT-4o展现出更强的社会正义感,而人类则表现出更丰富的情感维度。此外,我们探讨了情感对行为的潜在影响。本研究为LLM与利他价值观对齐的应用提供了理论基础。