As progress in AI continues to advance, it is important to know how advanced systems will make choices and in what ways they may fail. Machines can already outsmart humans in some domains, and understanding how to safely build ones which may have capabilities at or above the human level is of particular concern. One might suspect that artificially generally intelligent (AGI) and artificially superintelligent (ASI) will be systems that humans cannot reliably outsmart. As a challenge to this assumption, this paper presents the Achilles Heel hypothesis which states that even a potentially superintelligent system may nonetheless have stable decision-theoretic delusions which cause them to make irrational decisions in adversarial settings. In a survey of key dilemmas and paradoxes from the decision theory literature, a number of these potential Achilles Heels are discussed in context of this hypothesis. Several novel contributions are made toward understanding the ways in which these weaknesses might be implanted into a system.
翻译:随着人工智能的持续进步,了解先进系统将如何做出选择及其可能失败的方式至关重要。机器已在某些领域超越人类,而如何安全构建具备人类级或超越人类能力的系统尤为令人关注。人们或许会认为,通用人工智能(AGI)与超级人工智能(ASI)将是人类无法可靠智胜的系统。为挑战这一假设,本文提出"软肋假说",指出即使是潜在的超智能系统,仍可能具有稳定的决策理论性妄想,导致其在对抗环境中做出非理性决策。通过梳理决策理论文献中的关键困境与悖论,本文结合该假说探讨了若干潜在软肋,并提出多项创新性见解,以阐明这些弱点可能被植入系统的具体方式。