Deep generative models are known to produce undesirable samples such as harmful content. Traditional mitigation methods include re-training from scratch, filtering, or editing; however, these are either computationally expensive or can be circumvented by third parties. In this paper, we take a different approach and study how to post-edit an already-trained conditional generative model so that it redacts certain conditionals that will, with high probability, lead to undesirable content. This is done by distilling the conditioning network in the models, giving a solution that is effective, efficient, controllable, and universal for a class of deep generative models. We conduct experiments on redacting prompts in text-to-image models and redacting voices in text-to-speech models. Our method is computationally light, leads to better redaction quality and robustness than baseline methods while still retaining high generation quality.
翻译:深度生成模型已知会产生不良样本,例如有害内容。传统的缓解方法包括从头开始重新训练、过滤或编辑;然而,这些方法要么计算成本高昂,要么会被第三方规避。本文采用另一种思路,研究如何对已训练的条件生成模型进行后期编辑,使其能够删除那些高概率导致不良内容的特定条件。这通过蒸馏模型中的条件网络实现,得到一种对于一类深度生成模型而言有效、高效、可控且通用的解决方案。我们在文本到图像模型中的提示删除任务和文本到语音模型中的语音删除任务上进行了实验。我们的方法计算量轻,在删除质量和鲁棒性上优于基线方法,同时仍保持高生成质量。