While modern text-to-image models excel at prompt-based generation, they often lack the fine-grained control necessary for specific user requirements like spatial layouts or subject appearances. Multi-condition control addresses this, yet its integration into Diffusion Transformers (DiTs) is bottlenecked by the conventional ``concatenate-and-attend'' strategy, which suffers from quadratic computational and memory overhead as the number of conditions scales. Our analysis reveals that much of this cross-modal interaction is spatially or semantically redundant. To this end, we propose Position-aligned and Keyword-scoped Attention (PKA), a highly efficient framework designed to eliminate these redundancies. Specifically, Position-Aligned Attention (PAA) linearizes spatial control by enforcing localized patch alignment, while Keyword-Scoped Attention (KSA) prunes irrelevant subject-driven interactions via semantic-aware masking. To facilitate efficient learning, we further introduce a Conditional Sensitivity-Aware Sampling (CSAS) strategy that reweights the training objective towards critical denoising phases, drastically accelerating convergence and enhancing conditional fidelity. Empirically, PKA delivers a 10.0$\times$ inference speedup and a 5.1$\times$ VRAM saving, providing a scalable and resource-friendly solution for high-fidelity multi-conditioned generation.
翻译:尽管现代文本到图像模型在基于提示的生成方面表现出色,但它们通常缺乏满足特定用户需求(如空间布局或主体外观)所需的细粒度控制。多条件控制旨在解决这一问题,然而其与扩散Transformer(DiT)的集成受限于传统的“拼接-注意力”策略,该策略在条件数量增加时会遭遇二次方的计算与内存开销。我们的分析表明,这种跨模态交互在很大程度上存在空间或语义冗余。为此,我们提出位置对齐与关键词限定注意力(PKA),这是一个旨在消除这些冗余的高效框架。具体而言,位置对齐注意力(PAA)通过强制局部补丁对齐来实现空间控制的线性化,而关键词限定注意力(KSA)则通过语义感知掩码来剪除不相关的主体驱动交互。为了促进高效学习,我们进一步引入条件敏感性感知采样(CSAS)策略,该策略将训练目标重新加权至关键的去噪阶段,从而大幅加速收敛并提升条件保真度。实验表明,PKA实现了10.0倍的推理加速和5.1倍的显存节省,为高保真多条件生成提供了一个可扩展且资源友好的解决方案。