Diffusion transformers (DiTs) achieve high generative quality but lock FLOPs to image resolution, limiting principled latency-quality trade-offs, and allocate computation uniformly across input spatial tokens, wasting resource allocation to unimportant regions. We introduce Elastic Latent Interface Transformer (ELIT), a drop-in, DiT-compatible mechanism that decouples input image size from compute. Our approach inserts a latent interface, a learnable variable-length token sequence on which standard transformer blocks can operate. Lightweight Read and Write cross-attention layers move information between spatial tokens and latents and prioritize important input regions. By training with random dropping of tail latents, ELIT learns to produce importance-ordered representations with earlier latents capturing global structure while later ones contain information to refine details. At inference, the number of latents can be dynamically adjusted to match compute constraints. ELIT is deliberately minimal, adding two cross-attention layers while leaving the rectified flow objective and the DiT stack unchanged. Across datasets and architectures (DiT, U-ViT, HDiT, MM-DiT), ELIT delivers consistent gains. On ImageNet-1K 512px, ELIT delivers an average gain of $35.3\%$ and $39.6\%$ in FID and FDD scores. Project page: https://snap-research.github.io/elit/
翻译:扩散Transformer(DiTs)虽能实现较高的生成质量,但其计算量(FLOPs)与图像分辨率绑定,限制了在延迟与质量之间进行系统性权衡的可能性,且其计算资源均匀分配于所有输入空间标记,导致对非关键区域的资源浪费。本文提出弹性潜变量接口Transformer(ELIT),一种即插即用且与DiT兼容的机制,实现了输入图像尺寸与计算量的解耦。该方法引入一个潜变量接口——一组可学习的变长标记序列,标准Transformer模块可在此序列上运行。轻量级的读写交叉注意力层负责在空间标记与潜变量之间传递信息,并优先处理重要的输入区域。通过随机丢弃尾部潜变量的训练策略,ELIT学会生成按重要性排序的表征:靠前的潜变量捕获全局结构,而靠后的潜变量则包含用于细化细节的信息。在推理阶段,可动态调整潜变量数量以适应计算约束。ELIT的设计力求极简,仅增加两个交叉注意力层,同时保持修正流目标函数与DiT主干结构不变。在多种数据集与架构(DiT、U-ViT、HDiT、MM-DiT)上的实验表明,ELIT均能带来稳定性能提升。在ImageNet-1K 512px数据集上,ELIT在FID与FDD指标上分别实现平均$35.3\%$与$39.6\%$的提升。项目页面:https://snap-research.github.io/elit/