As the capabilities of large machine learning models continue to grow, and as the autonomy afforded to such models continues to expand, the spectre of a new adversary looms: the models themselves. The threat that a model might behave in a seemingly reasonable manner, while secretly and subtly modifying its behavior for ulterior reasons is often referred to as deceptive alignment in the AI Safety & Alignment communities. Consequently, we call this new direction Deceptive Alignment Monitoring. In this work, we identify emerging directions in diverse machine learning subfields that we believe will become increasingly important and intertwined in the near future for deceptive alignment monitoring, and we argue that advances in these fields present both long-term challenges and new research opportunities. We conclude by advocating for greater involvement by the adversarial machine learning community in these emerging directions.
翻译:随着大规模机器学习模型能力的持续增强,以及赋予这类模型的自主性不断扩大,一种新的威胁悄然浮现:模型本身。模型可能表现出看似合理的行为,同时却暗中且微妙地出于隐秘动机改变其行为——这种威胁在AI安全与对齐社区中常被称为“欺骗性对齐”。因此,我们将这一新方向命名为“欺骗性对齐监控”。本文识别了机器学习各子领域中新兴的研究方向,我们认为这些方向在不久的将来对于欺骗性对齐监控将变得日益重要且相互交织;同时,我们论证这些领域的进展既带来了长期挑战,也开辟了新的研究机遇。最后,我们呼吁对抗性机器学习社区更多地参与这些新兴方向的研究。