The Alignment Target Problem: Divergent Moral Judgments of Humans, AI Systems, and Their Designers

The project of aligning machine behavior with human values raises a basic problem: whose moral expectations should guide AI decision-making? Much alignment research assumes that the appropriate benchmark is how humans themselves would act in a given situation. Studies of agent-type value forks challenge this assumption by showing that people do not always judge humans and AI systems identically.This paper extends that challenge by examining two further possibilities: first, that evaluations of AI behavior change when its human origins are made visible; and second, that people judge the humans who program AI systems differently from either the machines or the human actors they are compared against. An experiment with 1,002 U.S. adults measured moral judgments in a runaway mine train scenario, varying the subject of evaluation across four conditions: a repairman, a repair robot, a repair robot programmed by company engineers, and company engineers programming a repair robot. We find no significant difference in evaluations of the repairman and the robot. However, judgments shifted substantially when the robot's actions were described as the product of human design. Participants exhibited markedly more deontological, rule-based reasoning when evaluating either the programmed robot or the engineers who programmed it, suggesting that rendering human agency visible activates heightened moral constraints. These findings indicate that people may evaluate humans, AI systems acting in the same situation, and the humans who design them in meaningfully different ways. The fact that these evaluations do not necessarily converge gives rise to the alignment target problem: which normative target should guide the development of artificial moral agents in high-stakes domains, and whether these plural judgments can be reconciled within a coherent account of value alignment.

翻译：将机器行为与人类价值观对齐的项目引发了一个基本问题：人工智能的决策应遵循谁的道德期望？许多对齐研究假设，适当的基准是人类在特定情境中自身会如何行动。对智能体类型价值分岔的研究质疑了这一假设，表明人们并非总是以相同方式评判人类与人工智能系统。本文通过考察两种进一步的可能性来拓展这一挑战：其一，当人工智能行为的人类来源被揭示时，对其评价是否会发生变化；其二，人们是否会对编程人工智能系统的人类作出不同于机器或人类行动者的评判。一项针对1002名美国成年人的实验测量了在失控矿车场景中的道德判断，通过四种条件改变评价对象：修理工、修理机器人、由公司工程师编程的修理机器人，以及编程修理机器人的公司工程师。我们发现，对修理工与机器人的评价无显著差异。然而，当机器人行为被描述为人类设计的产物时，判断发生重大转变。参与者在评价被编程的机器人或编程机器人的工程师时，表现出显著更强的道义论（基于规则）推理，这表明使人类能动性可视化会激活更强的道德约束。这些发现表明，人们可能以意义不同的方式评价在相同情境中行动的人类、人工智能系统及其设计者。这些评价未必趋同的事实引发了对齐目标问题：在高风险领域，应依据何种规范性目标来发展人工道德智能体，以及这些多元判断能否在连贯的价值对齐框架内得到调和。