Text-to-image generation with visual autoregressive~(VAR) models has recently achieved impressive advances in generation fidelity and inference efficiency. While control mechanisms have been explored for diffusion models, enabling precise and flexible control within VAR paradigm remains underexplored. To bridge this critical gap, in this paper, we introduce ScaleWeaver, a novel framework designed to achieve high-fidelity, controllable generation upon advanced VAR models through parameter-efficient fine-tuning. The core module in ScaleWeaver is the improved MMDiT block with the proposed Reference Attention module, which efficiently and effectively incorporates conditional information. Different from MM Attention, the proposed Reference Attention module discards the unnecessary attention from image$\rightarrow$condition, reducing computational cost while stabilizing control injection. Besides, it strategically emphasizes parameter reuse, leveraging the capability of the VAR backbone itself with a few introduced parameters to process control information, and equipping a zero-initialized linear projection to ensure that control signals are incorporated effectively without disrupting the generative capability of the base model. Extensive experiments show that ScaleWeaver delivers high-quality generation and precise control while attaining superior efficiency over diffusion-based methods, making ScaleWeaver a practical and effective solution for controllable text-to-image generation within the visual autoregressive paradigm. Code and models will be released.
翻译:基于视觉自回归模型(VAR)的文本到图像生成技术近期在生成保真度和推理效率方面取得了显著进展。尽管针对扩散模型的控制机制已有较多探索,但在VAR范式下实现精确灵活的控制仍研究不足。为填补这一关键空白,本文提出ScaleWeaver——一种通过参数高效微调在先进VAR模型上实现高保真可控生成的新型框架。ScaleWeaver的核心模块是改进的MMDiT块及其提出的参考注意力模块,该模块能高效且有效地整合条件信息。与MM注意力不同,所提出的参考注意力模块摒弃了从图像到条件的非必要注意力计算,在降低计算成本的同时稳定了控制信号的注入。此外,该模块策略性地强调参数复用:利用VAR骨干网络自身能力配合少量新增参数处理控制信息,并配备零初始化线性投影层,确保控制信号在有效融入的同时不破坏基础模型的生成能力。大量实验表明,ScaleWeaver在实现高质量生成与精确控制的同时,较基于扩散的方法具有更优的效率,使其成为视觉自回归范式下实用且有效的可控文本到图像生成解决方案。代码与模型将公开释出。