Despite the state-of-the-art performance of large language models (LLMs) across diverse tasks, their susceptibility to adversarial attacks and unsafe content generation remains a major obstacle to deployment, particularly in high-stakes settings. Addressing this challenge requires safety mechanisms that are both practically effective and supported by rigorous theory. We introduce BarrierSteer, a novel framework that formalizes response safety by embedding learned non-linear safety constraints directly into the model's latent representation space. BarrierSteer employs a steering mechanism based on Control Barrier Functions (CBFs) to efficiently detect and prevent unsafe response trajectories during inference with high precision. By enforcing multiple safety constraints through efficient constraint merging, without modifying the underlying LLM parameters, BarrierSteer preserves the model's original capabilities and performance. We provide theoretical results establishing that applying CBFs in latent space offers a principled and computationally efficient approach to enforcing safety. Our experiments across multiple models and datasets show that BarrierSteer substantially reduces adversarial success rates, decreases unsafe generations, and outperforms existing methods.
翻译:尽管大语言模型(LLM)在各类任务中展现出最先进的性能,但其对对抗攻击的易感性及不安全内容生成问题仍是实际部署(尤其是在高风险场景中)的主要障碍。应对这一挑战需要兼具实际有效性且具备严谨理论支撑的安全机制。本文提出BarrierSteer——一种通过在模型潜在表示空间中直接嵌入学习到的非线性安全约束来形式化响应安全的新型框架。BarrierSteer采用基于控制屏障函数(CBF)的导向机制,能够在推理过程中高效检测并高精度阻断不安全响应轨迹。该框架通过高效的约束融合技术强制执行多重安全约束,且无需修改底层LLM参数,从而完整保留模型的原始能力与性能。我们提供了理论证明,表明在潜在空间中应用CBF为安全约束执行提供了原则性强且计算高效的方法。在多个模型与数据集上的实验表明,BarrierSteer能显著降低对抗攻击成功率、减少不安全内容生成,并优于现有方法。