Aligning AI systems with human values remains a fundamental challenge, but does our inability to create perfectly aligned models preclude obtaining the benefits of alignment? We study a strategic setting where a human user interacts with multiple differently misaligned AI agents, none of which are individually well-aligned. Our key insight is that when the users utility lies approximately within the convex hull of the agents utilities, a condition that becomes easier to satisfy as model diversity increases, strategic competition can yield outcomes comparable to interacting with a perfectly aligned model. We model this as a multi-leader Stackelberg game, extending Bayesian persuasion to multi-round conversations between differently informed parties, and prove three results: (1) when perfect alignment would allow the user to learn her Bayes-optimal action, she can also do so in all equilibria under the convex hull condition (2) under weaker assumptions requiring only approximate utility learning, a non-strategic user employing quantal response achieves near-optimal utility in all equilibria and (3) when the user selects the best single AI after an evaluation period, equilibrium guarantees remain near-optimal without further distributional assumptions. We complement the theory with two sets of experiments.
翻译:使人工智能系统与人类价值观保持一致仍然是一个根本性挑战,但我们无法创建完美对齐模型这一事实是否意味着无法获得对齐带来的益处?我们研究了一个战略场景:人类用户与多个存在不同错位程度的人工智能代理进行交互,其中没有任何单个代理具备良好的对齐性。我们的核心洞见是:当用户的效用函数近似位于所有代理效用函数的凸包内部时——这一条件随着模型多样性的增加而更容易满足——战略竞争能够产生与使用完美对齐模型相当的结果。我们将此建模为多领导者斯塔克尔伯格博弈,将贝叶斯劝说理论扩展至具有不同信息的多轮对话场景,并证明了三个结果:(1)当完美对齐可使用户学习其贝叶斯最优行动时,在凸包条件下所有均衡中用户同样能够实现这一目标;(2)在仅需近似效用学习的较弱假设下,采用量子响应的非策略性用户在所有均衡中均可获得接近最优的效用;(3)当用户在评估期后选择最佳单一人工智能时,无需额外分布假设即可保持接近最优的均衡保证。我们通过两组实验对理论进行了补充验证。