Artificial intelligence (AI) is interacting with people at an unprecedented scale, offering new avenues for immense positive impact, but also raising widespread concerns around the potential for individual and societal harm. Today, the predominant paradigm for human--AI safety focuses on fine-tuning the generative model's outputs to better agree with human-provided examples or feedback. In reality, however, the consequences of an AI model's outputs cannot be determined in isolation: they are tightly entangled with the responses and behavior of human users over time. In this paper, we distill key complementary lessons from AI safety and control systems safety, highlighting open challenges as well as key synergies between both fields. We then argue that meaningful safety assurances for advanced AI technologies require reasoning about how the feedback loop formed by AI outputs and human behavior may drive the interaction towards different outcomes. To this end, we introduce a unifying formalism to capture dynamic, safety-critical human--AI interactions and propose a concrete technical roadmap towards next-generation human-centered AI safety.
翻译:人工智能(AI)正以前所未有的规模与人类互动,为产生巨大积极影响提供了新途径,但也引发了关于潜在个人与社会危害的广泛担忧。当前,人机安全的主流范式侧重于微调生成模型的输出,以使其更符合人类提供的示例或反馈。然而,实际上,AI模型输出的后果无法孤立地确定:它们与人类用户随时间的反应和行为紧密交织。本文从AI安全和控制系统安全中提炼出关键的互补性经验,强调了两大领域面临的开放挑战以及重要的协同效应。我们进而指出,对先进AI技术提供有意义的安全保障,需要推理由AI输出与人类行为构成的反馈回路如何驱动交互走向不同结果。为此,我们引入了一种统一的形式化框架来捕捉动态的、安全关键的人机交互,并提出了面向下一代以人为中心的AI安全的具体技术路线图。