TP-Blend: Textual-Prompt Attention Pairing for Precise Object-Style Blending in Diffusion Models

Current text-conditioned diffusion editors handle single object replacement well but struggle when a new object and a new style must be introduced simultaneously. We present Twin-Prompt Attention Blend (TP-Blend), a lightweight training-free framework that receives two separate textual prompts, one specifying a blend object and the other defining a target style, and injects both into a single denoising trajectory. TP-Blend is driven by two complementary attention processors. Cross-Attention Object Fusion (CAOF) first averages head-wise attention to locate spatial tokens that respond strongly to either prompt, then solves an entropy-regularised optimal transport problem that reassigns complete multi-head feature vectors to those positions. CAOF updates feature vectors at the full combined dimensionality of all heads (e.g., 640 dimensions in SD-XL), preserving rich cross-head correlations while keeping memory low. Self-Attention Style Fusion (SASF) injects style at every self-attention layer through Detail-Sensitive Instance Normalization. A lightweight one-dimensional Gaussian filter separates low- and high-frequency components; only the high-frequency residual is blended back, imprinting brush-stroke-level texture without disrupting global geometry. SASF further swaps the Key and Value matrices with those derived from the style prompt, enforcing context-aware texture modulation that remains independent of object fusion. Extensive experiments show that TP-Blend produces high-resolution, photo-realistic edits with precise control over both content and appearance, surpassing recent baselines in quantitative fidelity, perceptual quality, and inference speed.

翻译：当前基于文本条件的扩散编辑方法在处理单一对象替换时表现良好，但当需要同时引入新对象与新风格时则面临困难。本文提出双提示注意力融合（TP-Blend），这是一种轻量级免训练框架，可接收两个独立的文本提示——一个指定待融合对象，另一个定义目标风格，并将两者注入到单一去噪轨迹中。TP-Blend由两个互补的注意力处理器驱动。交叉注意力对象融合（CAOF）首先通过平均头注意力定位对任一提示响应强烈的空间标记，随后求解熵正则化的最优传输问题，将完整的多头特征向量重新分配到这些位置。CAOF以所有注意力头的总维度（例如SD-XL中的640维）更新特征向量，在保持丰富跨头相关性的同时维持较低内存开销。自注意力风格融合（SASF）通过细节敏感实例归一化在每个自注意力层注入风格信息：采用轻量级一维高斯滤波器分离低频与高频分量，仅将高频残差混合回原特征，从而在保持全局几何结构的同时植入笔触级纹理。SASF进一步将键值矩阵替换为源自风格提示的对应矩阵，实现与对象融合解耦的上下文感知纹理调制。大量实验表明，TP-Blend能生成高分辨率、照片级真实感的编辑结果，在内容与外观上均实现精确控制，在定量保真度、感知质量和推理速度方面均超越现有基线方法。