Current research on operator control of Large Language Models improves model robustness against adversarial attacks and misbehavior by training on preference examples, prompting, and input/output filtering. Despite good results, LLMs remain susceptible to abuse, and jailbreak probability increases with context length. There is a need for robust LLM security guarantees in long-context situations. We propose control sentences inserted into the LLM context as invasive context engineering to partially solve the problem. We suggest this technique can be generalized to the Chain-of-Thought process to prevent scheming. Invasive Context Engineering does not rely on LLM training, avoiding data shortage pitfalls which arise in training models for long context situations.
翻译:当前关于大型语言模型操作控制的研究通过偏好示例训练、提示工程及输入/输出过滤来提升模型对抗攻击与不当行为的鲁棒性。尽管取得良好效果,LLM仍易受滥用,且越狱概率随上下文长度增加而上升。长上下文场景亟需可靠的LLM安全保证机制。本文提出将控制语句作为侵入式上下文工程嵌入LLM上下文,以部分解决该问题。我们认为该技术可推广至思维链过程以防范策略性欺骗。侵入式上下文工程不依赖LLM训练,避免了长上下文场景下模型训练面临的数据短缺困境。