Vision-language models (VLMs) often inherit the biases and unsafe associations present within their large-scale training dataset. While recent approaches mitigate unsafe behaviors, their evaluation focuses on how safe the model is on unsafe inputs, ignoring potential shortcomings on safe ones. In this paper, we first revise safety evaluation by introducing SafeGround, a new set of metrics that evaluate safety at different levels of granularity. With this metric, we uncover a surprising issue of training-based methods: they make the model less safe on safe inputs. From this finding, we take a different direction and explore whether it is possible to make a model safer without training, introducing Unsafe Weights Manipulation (UWM). UWM uses a calibration set of safe and unsafe instances to compare activations between safe and unsafe content, identifying the most important parameters for processing the latter. Their values are then manipulated via negation. Experiments show that UWM achieves the best tradeoff between safety and knowledge preservation, consistently improving VLMs on unsafe queries while outperforming even training-based state-of-the-art methods on safe ones.
翻译:视觉语言模型(VLMs)常继承其大规模训练数据集中存在的偏见与不安全关联。尽管近期方法已致力于缓解不安全行为,但其评估主要关注模型在不安全输入上的安全性表现,而忽视了模型在安全输入上可能存在的缺陷。本文首先通过引入SafeGround来修正安全性评估框架——这是一套在不同粒度层级评估安全性的新指标体系。基于该指标,我们揭示出基于训练的方法存在一个令人意外的问题:它们会降低模型在安全输入上的安全性。基于这一发现,我们转向探索不依赖训练即可提升模型安全性的可能性,提出了不安全权重操作(UWM)方法。UWM通过一组包含安全与不安全实例的校准数据集,对比分析安全内容与不安全内容处理过程中的激活差异,从而识别出对不安全内容处理最为关键的参数。随后通过数值取反操作对这些参数进行调控。实验表明,UWM在安全性知识与知识保留之间实现了最佳平衡,不仅能持续提升VLMs处理不安全查询时的安全性,在安全查询任务上的表现甚至超越了当前基于训练的最先进方法。