As the capabilities of large language models continue to advance, so does their potential for misuse. While closed-source models typically rely on external defenses, open-weight models must primarily depend on internal safeguards to mitigate harmful behavior. Prior red-teaming research has largely focused on input-based jailbreaking and parameter-level manipulations. However, open-weight models also natively support prefilling, which allows an attacker to predefine initial response tokens before generation begins. Despite its potential, this attack vector has received little systematic attention. We present the largest empirical study to date of prefill attacks, evaluating over 20 existing and novel strategies across multiple model families and state-of-the-art open-weight models. Our results show that prefill attacks are consistently effective against all major contemporary open-weight models, revealing a critical and previously underexplored vulnerability with significant implications for deployment. While certain large reasoning models exhibit some robustness against generic prefilling, they remain vulnerable to tailored, model-specific strategies. Our findings underscore the urgent need for model developers to prioritize defenses against prefill attacks in open-weight LLMs.
翻译:随着大型语言模型能力的持续进步,其被滥用的可能性也相应增加。闭源模型通常依赖外部防御机制,而开源权重模型则主要依靠内部安全措施来减轻有害行为。先前的红队研究主要集中在基于输入的越狱和参数级操控上。然而,开源权重模型本身也支持预填充功能,这使得攻击者能够在生成开始前预先定义初始响应令牌。尽管存在潜在风险,但这一攻击向量尚未得到系统的关注。我们提出了迄今为止规模最大的预填充攻击实证研究,在多个模型系列和先进的开源权重模型上评估了超过20种现有及新颖的攻击策略。我们的研究结果表明,预填充攻击对所有主流的当代开源权重模型均具有持续的有效性,揭示了一个关键且先前未被充分探索的漏洞,这对模型部署具有重大影响。尽管某些大型推理模型对通用预填充表现出一定的鲁棒性,但它们仍然容易受到针对特定模型定制的策略的攻击。我们的发现强调了模型开发者迫切需要优先考虑针对开源大语言模型中预填充攻击的防御措施。