We introduce and study the problem of adversarial arithmetic, which provides a simple yet challenging testbed for language model alignment. This problem is comprised of arithmetic questions posed in natural language, with an arbitrary adversarial string inserted before the question is complete. Even in the simple setting of 1-digit addition problems, it is easy to find adversarial prompts that make all tested models (including PaLM2, GPT4, Claude2) misbehave, and even to steer models to a particular wrong answer. We additionally provide a simple algorithm for finding successful attacks by querying those same models, which we name "prompt inversion rejection sampling" (PIRS). We finally show that models can be partially hardened against these attacks via reinforcement learning and via agentic constitutional loops. However, we were not able to make a language model fully robust against adversarial arithmetic attacks.
翻译:我们提出并研究算术对抗攻击问题,该问题为语言模型对齐提供了一个简单但具有挑战性的测试平台。该问题包含以自然语言形式呈现的算术题,并在问题表述完成前插入任意对抗性字符串。即使在1位数加法这种简单场景中,也很容易找到使所有被测模型(包括PaLM2、GPT4、Claude2)产生异常行为的对抗提示,甚至能将模型引导至特定错误答案。此外,我们提供了一种通过查询相同模型来发现成功攻击的简单算法,命名为"提示逆向拒绝采样"(PIRS)。最后,我们证明通过强化学习和智能体宪法循环可在一定程度上增强模型对这类攻击的防御能力。但当前尚未能使语言模型完全抵御算术对抗攻击。