Model editing methods modify specific behaviors of Large Language Models by altering a small, targeted set of network weights and require very little data and compute. These methods can be used for malicious applications such as inserting misinformation or simple trojans that result in adversary-specified behaviors when a trigger word is present. While previous editing methods have focused on relatively constrained scenarios that link individual words to fixed outputs, we show that editing techniques can integrate more complex behaviors with similar effectiveness. We develop Concept-ROT, a model editing-based method that efficiently inserts trojans which not only exhibit complex output behaviors, but also trigger on high-level concepts -- presenting an entirely new class of trojan attacks. Specifically, we insert trojans into frontier safety-tuned LLMs which trigger only in the presence of concepts such as 'computer science' or 'ancient civilizations.' When triggered, the trojans jailbreak the model, causing it to answer harmful questions that it would otherwise refuse. Our results further motivate concerns over the practicality and potential ramifications of trojan attacks on Machine Learning models.
翻译:模型编辑方法通过修改网络权重中一个较小的目标子集来改变大型语言模型的具体行为,仅需极少的数据和计算资源。这些方法可被用于恶意用途,例如植入错误信息或简单木马,当触发词出现时即产生攻击者指定的行为。虽然以往的编辑方法主要关注将单个词汇与固定输出相连接的受限场景,但我们证明编辑技术能以相似效能整合更复杂的行为。我们开发了Concept-ROT——一种基于模型编辑的方法,能高效植入不仅展现复杂输出行为、且能由高级概念触发的木马,这代表了一类全新的木马攻击。具体而言,我们在前沿安全调优的大型语言模型中植入仅当出现“计算机科学”或“古代文明”等概念时才激活的木马。触发时,木马会突破模型安全限制,使其回答原本会拒绝的有害问题。我们的研究结果进一步引发了对机器学习模型木马攻击的实用性及潜在影响的担忧。