The rapid advancement of large language models (LLMs) has sparked growing interest in understanding their security vulnerabilities, particularly Trojan attacks that enable stealthy manipulation of model behavior. Traditional Trojan methods typically alter inputs and/or model weights, relying on white-box assumptions that require access to data or model internal parameters. In this work, we present CacheTrap, the first gray-box Trojan attack targeting the Key-Value (KV) cache of LLMs. This method induces a single-bit flip in the KV cache, serving as a transient trigger. When activated, this trigger causes the model to exhibit targeted actions without changing inputs or model weights. CacheTrap introduces an efficient search algorithm to locate vulnerable positions in the KV cache, independent of model weights or datasets. Extensive experiments on five open-source LLMs show a remarkable 100% attack success rate (with the trigger) while preserving benign accuracy (without the trigger) by flipping just one bit in the KV cache.
翻译:【摘要】大语言模型的快速发展引发了学界对其安全漏洞的持续关注,特别是能够隐蔽操控模型行为的木马攻击。传统木马方法通常通过修改输入和/或模型权重实现,依赖于需要访问数据或模型内部参数的白盒假设。本文提出CacheTrap,这是首个针对大语言模型键值缓存的灰盒木马攻击方法。该方法通过诱导键值缓存中的单比特翻转作为瞬态触发器,当触发器被激活时,模型无需修改输入或权重即可执行预设目标行为。CacheTrap引入了一种与模型权重和数据集无关的高效搜索算法,用于定位键值缓存中的易受攻击位置。在五个开源大语言模型上的大量实验表明,该方法仅需翻转键值缓存中的单个比特,即可在保持良性准确率(无触发器)的同时实现100%的攻击成功率(含触发器)。