The quest to align machine behavior with human values raises fundamental questions about the moral frameworks that should govern AI decision-making. Much alignment research assumes that the appropriate benchmark is how humans themselves would act in a given situation. Research into agent-type value forks has challenged this assumption by showing that people do not always hold AI systems to the same moral standards as humans. Yet this challenge is subject to two further questions: whether people evaluate AI behavior differently when its human origins are made visible, and whether people hold the humans who program AI systems to different moral standards than either the humans or the machines under evaluation. An experimental study on 1,002 U.S. adults measured moral judgments in a runaway mine train scenario, varying the subject of evaluation across four conditions: a repairman, a repair robot, a repair robot programmed by company engineers, and company engineers programming a repair robot. We find no significant variation in the moral standards applied to the repairman and the robot. However, moral judgments shifted substantially when robot actions were described as the product of human design. Participants exhibited markedly more deontological reasoning when evaluating the robot programmed by engineers or the engineers programming it, suggesting that making human design visible activates heightened moral constraints. These findings provide evidence that people apply meaningfully different moral standards to AI systems, to humans acting in the same situation, and to the humans who design them. We call this divergence the alignment target problem. Whether these plural normative standards can be reconciled into a coherent framework for AI governance in high-stakes domains remains an open question.
翻译:使机器行为与人类价值观对齐的努力引发了关于应指导AI决策的道德框架的根本性问题。许多对齐研究假定,适当的基准是人类自身在特定情境下的行为方式。针对智能体类型价值叉的研究质疑这一假设,表明人们并不总是要求AI系统遵循与人类相同的道德标准。然而,这一质疑面临两个进一步的问题:当AI行为的人类根源显现时,人们是否会对其做出不同评价?以及人们是否会对编程AI系统的人类持有不同于被评估的人类或机器的道德标准?一项针对1002名美国成年人的实验研究,在失控矿车场景中测量了道德判断,通过四种条件变化评估对象:修理工、修理机器人、由公司工程师编程的修理机器人、以及编程修理机器人的公司工程师。我们发现,适用于修理工与机器人的道德标准无显著差异。然而,当机器人行为被描述为人类设计产物时,道德判断发生实质性转变。参与者在评估由工程师编程的机器人或编程机器人的工程师时,表现出更显著的道义论推理,表明使人类设计可见会激活更高的道德约束。这些发现证明,人们对AI系统、在相同情境下行动的人类以及设计AI系统的人类,适用着显著不同的道德标准。我们将这种分歧称为"对齐目标问题"。这些多元规范性标准能否被调和为高风险领域AI治理的连贯框架,仍是一个悬而未决的问题。