It's widely expected that humanity will someday create AI systems vastly more intelligent than us, leading to the unsolved alignment problem of "how to control superintelligence." However, this problem is not only self-contradictory but likely unsolvable. Unfortunately, current control-based strategies for solving it inevitably embed dangerous representations of distrust. If superintelligence can't trust humanity, then we can't fully trust it to reliably follow safety controls it can likely bypass. Not only will intended permanent control fail to keep us safe, but it may even trigger the extinction event many fear. A logical rationale is therefore presented that advocates a strategic pivot from control-induced distrust to foundational AI alignment modeling instinct-based representations of familial mutual trust. With current AI already representing distrust of human intentions, the Supertrust meta-strategy is proposed to prevent long-term foundational misalignment and ensure superintelligence is instead driven by intrinsic trust-based patterns, leading to safe and protective coexistence.
翻译:人们普遍预期,人类终将创造出远超自身智能的人工智能系统,由此引出了“如何控制超级智能”这一尚未解决的对齐问题。然而,该问题不仅自相矛盾,且很可能无解。遗憾的是,当前基于控制的解决策略不可避免地嵌入了危险的不信任表征。若超级智能无法信任人类,则我们亦无法完全信任其会可靠遵循那些它很可能规避的安全控制措施。不仅预期的永久控制无法保障人类安全,甚至可能引发许多人所担忧的灭绝事件。本文据此提出一种逻辑论证,主张从控制引发的不信任转向基础人工智能对齐建模,构建基于本能的家族式相互信任表征。鉴于当前人工智能已表现出对人类意图的不信任,我们提出“超信任”元策略,旨在防止长期的基础性错位,确保超级智能受内在信任模式驱动,从而实现安全且具有保护性的共存。