Deep generative models are known to produce undesirable samples such as harmful content. Traditional mitigation methods include re-training from scratch, filtering, or editing; however, these are either computationally expensive or can be circumvented by third parties. In this paper, we take a different approach and study how to post-edit an already-trained conditional generative model so that it redacts certain conditionals that will, with high probability, lead to undesirable content. This is done by distilling the conditioning network in the models, giving a solution that is effective, efficient, controllable, and universal for a class of deep generative models. We conduct experiments on redacting prompts in text-to-image models and redacting voices in text-to-speech models. Our method is computationally light, leads to better redaction quality and robustness than baseline methods while still retaining high generation quality.
翻译:深度生成模型已知会产生不良样本,如有害内容。传统的缓解方法包括从头重新训练、过滤或编辑;然而,这些方法要么计算成本高昂,要么可能被第三方规避。在本文中,我们采取了一种不同的方法,研究如何对已训练的条件生成模型进行后期编辑,使其能够删除那些极有可能导致不良内容的条件。这是通过蒸馏模型中的条件网络来实现的,从而提供了一种有效、高效、可控且适用于一类深度生成模型的通用解决方案。我们在文本到图像模型中进行了提示删除实验,并在文本到语音模型中进行了语音删除实验。我们的方法计算量轻,相比基线方法具有更好的删除质量和鲁棒性,同时仍保持高质量的生成效果。