As the capabilities of large machine learning models continue to grow, and as the autonomy afforded to such models continues to expand, the spectre of a new adversary looms: the models themselves. The threat that a model might behave in a seemingly reasonable manner, while secretly and subtly modifying its behavior for ulterior reasons is often referred to as deceptive alignment in the AI Safety & Alignment communities. Consequently, we call this new direction Deceptive Alignment Monitoring. In this work, we identify emerging directions in diverse machine learning subfields that we believe will become increasingly important and intertwined in the near future for deceptive alignment monitoring, and we argue that advances in these fields present both long-term challenges and new research opportunities. We conclude by advocating for greater involvement by the adversarial machine learning community in these emerging directions.
翻译:随着大型机器学习模型能力的持续增长,以及赋予此类模型的自主性不断扩大,一个新的对手幽灵悄然浮现:模型本身。模型可能表面上表现合理,却暗中微妙地改变其行为以追求潜在动机,这种威胁在人工智能安全与对齐社区中常被称为"欺骗性对齐"。因此,我们将这一新方向命名为"欺骗性对齐监控"。本文识别了机器学习多个子领域中新兴的研究方向,我们认为这些方向在不远的将来对于欺骗性对齐监控将变得日益重要且相互交织,并论证这些领域的进展既带来了长期挑战,也催生了新的研究机遇。最后,我们呼吁对抗性机器学习社区更积极地参与这些新兴方向的研究。