Recent advances in diffusion-based controllable visual generation have led to remarkable improvements in image quality. However, these powerful models are typically deployed on cloud servers due to their large computational demands, raising serious concerns about user data privacy. To enable secure and efficient on-device generation, we explore in this paper controllable diffusion models built upon linear attention architectures, which offer superior scalability and efficiency, even on edge devices. Yet, our experiments reveal that existing controllable generation frameworks, such as ControlNet and OminiControl, either lack the flexibility to support multiple heterogeneous condition types or suffer from slow convergence on such linear-attention models. To address these limitations, we propose a novel controllable diffusion framework tailored for linear attention backbones like SANA. The core of our method lies in a unified gated conditioning module working in a dual-path pipeline, which effectively integrates multi-type conditional inputs, such as spatially aligned and non-aligned cues. Extensive experiments on multiple tasks and benchmarks demonstrate that our approach achieves state-of-the-art controllable generation performance based on linear-attention models, surpassing existing methods in terms of fidelity and controllability.
翻译:近期基于扩散的可控视觉生成技术取得了显著的图像质量提升。然而,这些强大模型通常部署在云端服务器上,因其巨大的计算需求引发了对用户数据隐私的严重担忧。为实现安全高效的设备端生成,本文探索了基于线性注意力架构的可控扩散模型,该架构即使在边缘设备上也具有卓越的可扩展性和效率。但实验表明,现有可控生成框架(如ControlNet和OminiControl)要么缺乏支持多种异构条件类型的灵活性,要么在线性注意力模型上收敛缓慢。针对这些局限,我们提出了一种专为SANA等线性注意力骨干网络定制的新型可控扩散框架。该方法的核心在于采用双通道流水线的统一门控条件模块,能够有效集成多类型条件输入(如空间对齐与非对齐线索)。在多个任务和基准上的大量实验表明,基于线性注意力模型,本方法在保真度和可控性方面均超越了现有方法,达到了最先进的可控生成性能。