LINA: Linear Autoregressive Image Generative Models with Continuous Tokens

Autoregressive models with continuous tokens form a promising paradigm for visual generation, especially for text-to-image (T2I) synthesis, but they suffer from high computational cost. We study how to design compute-efficient linear attention within this framework. Specifically, we conduct a systematic empirical analysis of scaling behavior with respect to parameter counts under different design choices, focusing on (1) normalization paradigms in linear attention (division-based vs. subtraction-based) and (2) depthwise convolution for locality augmentation. Our results show that although subtraction-based normalization is effective for image classification, division-based normalization scales better for linear generative transformers. In addition, incorporating convolution for locality modeling plays a crucial role in autoregressive generation, consistent with findings in diffusion models. We further extend gating mechanisms, commonly used in causal linear attention, to the bidirectional setting and propose a KV gate. By introducing data-independent learnable parameters to the key and value states, the KV gate assigns token-wise memory weights, enabling flexible memory management similar to forget gates in language models. Based on these findings, we present LINA, a simple and compute-efficient T2I model built entirely on linear attention, capable of generating high-fidelity 1024x1024 images from user instructions. LINA achieves competitive performance on both class-conditional and T2I benchmarks, obtaining 2.18 FID on ImageNet (about 1.4B parameters) and 0.74 on GenEval (about 1.5B parameters). A single linear attention module reduces FLOPs by about 61 percent compared to softmax attention. Code and models are available at: https://github.com/techmonsterwang/LINA.

翻译：采用连续令牌的自回归模型为视觉生成（尤其是文本到图像合成）提供了一种极具前景的范式，但其计算成本高昂。本研究探讨如何在该框架内设计计算高效的线性注意力机制。具体而言，我们针对不同设计选择下的参数规模扩展行为进行了系统性实证分析，重点关注：（1）线性注意力中的归一化范式（基于除法与基于减法）；（2）用于局部性增强的深度卷积。实验结果表明，尽管基于减法的归一化在图像分类任务中表现优异，但基于除法的归一化在生成式线性Transformer中展现出更优的扩展性。此外，引入卷积进行局部建模对自回归生成至关重要，这与扩散模型中的研究结论一致。我们进一步将因果线性注意力中常用的门控机制扩展至双向场景，提出了KV门控。通过向键状态与值状态引入数据无关的可学习参数，KV门控能够分配令牌级记忆权重，实现类似语言模型中遗忘门的灵活记忆管理机制。基于上述发现，我们提出了LINA——一个完全基于线性注意力构建的简洁高效文本到图像生成模型，能够根据用户指令生成高保真度的1024x1024图像。LINA在类别条件生成与文本到图像生成基准测试中均取得具有竞争力的性能：在ImageNet数据集上获得2.18 FID（约14亿参数），在GenEval数据集上获得0.74得分（约15亿参数）。单个线性注意力模块相较于softmax注意力可降低约61%的浮点运算量。代码与模型已开源：https://github.com/techmonsterwang/LINA。