We study how to subvert large language models (LLMs) from following prompt-specified rules. We model rule-following as inference in propositional Horn logic, a mathematical system in which rules have the form ``if $P$ and $Q$, then $R$'' for some propositions $P$, $Q$, and $R$. We prove that although LLMs can faithfully follow such rules, maliciously crafted prompts can mislead even idealized, theoretically constructed models. Empirically, we find that the reasoning behavior of LLMs aligns with that of our theoretical constructions, and popular attack algorithms find adversarial prompts with characteristics predicted by our theory. Our logic-based framework provides a novel perspective for mechanistically understanding the behavior of LLMs in rule-based settings such as jailbreak attacks.
翻译:我们研究如何使大型语言模型(LLMs)偏离提示指定的规则。我们将规则遵循建模为命题霍恩逻辑中的推理,这是一种数学系统,其中规则具有“若 $P$ 且 $Q$,则 $R$”的形式,其中 $P$、$Q$ 和 $R$ 为某些命题。我们证明,尽管LLMs能够忠实地遵循此类规则,但恶意构造的提示甚至可以误导理想化的理论构建模型。通过实证研究,我们发现LLMs的推理行为与我们理论构建的模型行为一致,且流行的攻击算法能找到具有我们理论预测特征的对抗性提示。我们基于逻辑的框架为从机制上理解LLMs在基于规则场景(如越狱攻击)中的行为提供了新的视角。