Training multimodal generative models on large, uncurated datasets can result in users being exposed to harmful, unsafe and controversial or culturally-inappropriate outputs. While model editing has been proposed to remove or filter undesirable concepts in embedding and latent spaces, it can inadvertently damage learned manifolds, distorting concepts in close semantic proximity. We identify limitations in current model editing techniques, showing that even benign, proximal concepts may become misaligned. To address the need for safe content generation, we propose a modular, dynamic solution that leverages safety-context embeddings and a dual reconstruction process using tunable weighted summation in the latent space to generate safer images. Our method preserves global context without compromising the structural integrity of the learned manifolds. We achieve state-of-the-art results on safe image generation benchmarks, while offering controllable variation of model safety. We identify trade-offs between safety and censorship, which presents a necessary perspective in the development of ethical AI models. We will release our code. Keywords: Text-to-Image Models, Generative AI, Safety, Reliability, Model Editing
翻译:在大型未筛选数据集上训练多模态生成模型可能导致用户接触到有害、不安全、具有争议性或文化不适宜的输出内容。虽然已有模型编辑方法被提出,通过在嵌入空间和潜在空间中移除或过滤不良概念,但这些方法可能无意中破坏已学习的流形结构,扭曲语义邻近的概念。我们指出了当前模型编辑技术的局限性,证明即使是良性的邻近概念也可能发生错位。为满足安全内容生成的需求,我们提出一种模块化、动态的解决方案,该方法利用安全上下文嵌入和通过潜在空间中可调加权求和实现的双重建过程来生成更安全的图像。我们的方法在保持全局上下文的同时,不损害已学习流形的结构完整性。我们在安全图像生成基准测试中取得了最先进的结果,同时提供了模型安全性的可控调节。我们揭示了安全性与审查制度之间的权衡关系,这为伦理人工智能模型的开发提供了必要的视角。我们将公开相关代码。关键词:文生图模型,生成式人工智能,安全性,可靠性,模型编辑