Existing work on the alignment problem has focused mainly on (1) qualitative descriptions of the alignment problem; (2) attempting to align AI actions with human interests by focusing on value specification and learning; and/or (3) focusing on a single agent or on humanity as a monolith. Recent sociotechnical approaches highlight the need to understand complex misalignment among multiple human and AI agents. We address this gap by adapting a computational social science model of human contention to the alignment problem. Our model quantifies misalignment in large, diverse agent groups with potentially conflicting goals across various problem areas. Misalignment scores in our framework depend on the observed agent population, the domain in question, and conflict between agents' weighted preferences. Through simulations, we demonstrate how our model captures intuitive aspects of misalignment across different scenarios. We then apply our model to two case studies, including an autonomous vehicle setting, showcasing its practical utility. Our approach offers enhanced explanatory power for complex sociotechnical environments and could inform the design of more aligned AI systems in real-world applications.
翻译:现有关于对齐问题的研究主要集中于:(1) 对齐问题的定性描述;(2) 通过关注价值规范与学习,试图使人工智能行为与人类利益对齐;和/或 (3) 聚焦于单个智能体或将人类视为整体。近期的社会技术方法强调需要理解多个人工智能与人类智能体之间复杂的错位现象。为填补这一空白,我们将计算社会科学中的人类争议模型应用于对齐问题。该模型能够量化具有潜在冲突目标的大型多样化智能体群体在不同问题领域的错位程度。在我们的框架中,错位分数取决于观察到的智能体群体、所涉领域以及智能体加权偏好之间的冲突。通过仿真实验,我们展示了该模型如何捕捉不同场景下错位现象的直观特征。随后,我们将模型应用于两个案例研究(包括自动驾驶场景),验证了其实用价值。该方法为复杂社会技术环境提供了更强的解释力,可为现实应用中设计更对齐的人工智能系统提供参考。