Text-to-image (T2I) models face significant safety risks from adversarial induction, yet current concept erasure methods often cause collateral damage to benign attributes when suppressing selected neurons entirely. This occurs because sensitive and benign semantics exhibit non-orthogonal superposition, sharing activation subspaces where their respective vectors are inherently entangled. To address this issue, we propose OrthoEraser, which leverages sparse autoencoders (SAE) to achieve high-resolution feature disentanglement and subsequently redefines erasure as an analytical orthogonalization projection that preserves the benign manifold's invariance. OrthoEraser first employs SAE to decompose dense activations and segregate sensitive neurons. It then uses coupled neuron detection to identify non-sensitive features vulnerable to intervention. The key novelty lies in an analytical gradient orthogonalization strategy that projects erasure vectors onto the null space of the coupled neurons. This orthogonally decouples the sensitive concepts from the identified critical benign subspace, effectively preserving non-sensitive semantics. Experimental results on safety demonstrate that OrthoEraser achieves high erasure precision, effectively removing harmful content while preserving the integrity of the generative manifold, and significantly outperforming SOTA baselines. This paper contains results of unsafe models.
翻译:文本到图像(T2I)模型面临对抗性诱导带来的重大安全风险,然而当前的概念擦除方法在完全抑制选定神经元时,常对良性属性造成附带损害。这是因为敏感语义与良性语义表现出非正交叠加,共享其向量本征纠缠的激活子空间。为解决此问题,我们提出OrthoEraser,该方法利用稀疏自编码器(SAE)实现高分辨率特征解耦,并随后将擦除重新定义为一种保持良性流形不变性的解析正交化投影。OrthoEraser首先使用SAE分解稠密激活并分离敏感神经元,然后通过耦合神经元检测识别易受干预的非敏感特征。其核心创新在于一种解析梯度正交化策略,该策略将擦除向量投影到耦合神经元的零空间上。这以正交方式将敏感概念与已识别的关键良性子空间解耦,从而有效保留非敏感语义。安全性实验结果表明,OrthoEraser实现了高擦除精度,在有效去除有害内容的同时保持了生成流形的完整性,并显著优于当前最优(SOTA)基线方法。本文包含不安全模型的相关结果。