Text-based adversarial guidance using a negative prompt has emerged as a widely adopted approach to push the output features away from undesired concepts. While useful, performing adversarial guidance using text alone can be insufficient to capture complex visual concepts and avoid undesired visual elements like copyrighted characters. In this paper, for the first time we explore an alternate modality in this direction by performing adversarial guidance directly using visual features from a reference image or other images in a batch. In particular, we introduce negative token merging (NegToMe), a simple but effective training-free approach which performs adversarial guidance by selectively pushing apart matching semantic features (between reference and output generation) during the reverse diffusion process. When used w.r.t. other images in the same batch, we observe that NegToMe significantly increases output diversity (racial, gender, visual) without sacrificing output image quality. Similarly, when used w.r.t. a reference copyrighted asset, NegToMe helps reduce visual similarity with copyrighted content by 34.57%. NegToMe is simple to implement using just few-lines of code, uses only marginally higher (<4%) inference times and generalizes to different diffusion architectures like Flux, which do not natively support the use of a separate negative prompt. Code is available at https://negtome.github.io
翻译:基于文本的对抗性引导通过使用负向提示词,已成为一种广泛采用的方法,旨在使输出特征远离不期望的概念。尽管有效,但仅使用文本进行对抗性引导可能不足以捕捉复杂的视觉概念,并避免诸如受版权保护角色等不期望的视觉元素。本文首次探索了该方向上的另一种模态,即直接使用参考图像或批次中其他图像的视觉特征进行对抗性引导。具体而言,我们提出了负向令牌合并(NegToMe),这是一种简单而有效的免训练方法,通过在反向扩散过程中有选择性地推离匹配的语义特征(参考图像与输出生成之间)来执行对抗性引导。当针对同一批次中的其他图像使用时,我们观察到NegToMe在不牺牲输出图像质量的前提下,显著增加了输出多样性(种族、性别、视觉)。类似地,当针对受版权保护的参考资产使用时,NegToMe有助于将视觉相似度降低34.57%。NegToMe实现简单,仅需数行代码,推理时间仅略微增加(<4%),并可泛化至不同的扩散架构(如Flux),这些架构本身并不支持使用独立的负向提示词。代码发布于https://negtome.github.io。