The Alignment Target Problem: Divergent Moral Judgments of Humans, AI Systems, and Their Designers

The quest to align machine behavior with human values raises fundamental questions about the moral frameworks that should govern AI decision-making. Much alignment research assumes that the appropriate benchmark is how humans themselves would act in a given situation. Research into agent-type value forks has challenged this assumption by showing that people do not always hold AI systems to the same moral standards as humans. Yet this challenge is subject to two further questions: whether people evaluate AI behavior differently when its human origins are made visible, and whether people hold the humans who program AI systems to different moral standards than either the humans or the machines under evaluation. An experimental study on 1,002 U.S. adults measured moral judgments in a runaway mine train scenario, varying the subject of evaluation across four conditions: a repairman, a repair robot, a repair robot programmed by company engineers, and company engineers programming a repair robot. We find no significant variation in the moral standards applied to the repairman and the robot. However, moral judgments shifted substantially when robot actions were described as the product of human design. Participants exhibited markedly more deontological reasoning when evaluating the robot programmed by engineers or the engineers programming it, suggesting that making human design visible activates heightened moral constraints. These findings provide evidence that people apply meaningfully different moral standards to AI systems, to humans acting in the same situation, and to the humans who design them. We call this divergence the alignment target problem. Whether these plural normative standards can be reconciled into a coherent framework for AI governance in high-stakes domains remains an open question.

翻译：使机器行为与人类价值观对齐的努力引发了关于应指导AI决策的道德框架的根本性问题。许多对齐研究假定，适当的基准是人类自身在特定情境下的行为方式。针对智能体类型价值叉的研究质疑这一假设，表明人们并不总是要求AI系统遵循与人类相同的道德标准。然而，这一质疑面临两个进一步的问题：当AI行为的人类根源显现时，人们是否会对其做出不同评价？以及人们是否会对编程AI系统的人类持有不同于被评估的人类或机器的道德标准？一项针对1002名美国成年人的实验研究，在失控矿车场景中测量了道德判断，通过四种条件变化评估对象：修理工、修理机器人、由公司工程师编程的修理机器人、以及编程修理机器人的公司工程师。我们发现，适用于修理工与机器人的道德标准无显著差异。然而，当机器人行为被描述为人类设计产物时，道德判断发生实质性转变。参与者在评估由工程师编程的机器人或编程机器人的工程师时，表现出更显著的道义论推理，表明使人类设计可见会激活更高的道德约束。这些发现证明，人们对AI系统、在相同情境下行动的人类以及设计AI系统的人类，适用着显著不同的道德标准。我们将这种分歧称为"对齐目标问题"。这些多元规范性标准能否被调和为高风险领域AI治理的连贯框架，仍是一个悬而未决的问题。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

【NeurIPS2025教程】人类–AI 对齐：基础、方法、实践与挑战

专知会员服务

26+阅读 · 2025年12月7日

《在单智能体与多智能体AI系统中融入人类合理性》100页

专知会员服务

32+阅读 · 2025年5月10日

《人工智能指挥官问题：人工智能战争中人机互动的伦理、政治和心理困境》

专知会员服务

25+阅读 · 2024年10月12日

【MIT博士论文】人工智能与人类对齐的构建模块：指定、检查、建模和修订，216页pdf

专知会员服务

44+阅读 · 2024年4月2日