Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. PixelDiT achieves 1.61 FID on ImageNet 256 and 1.81 FID on ImageNet 512, surpassing existing pixel generative models. We further extend PixelDiT to text-to-image generation and pretrain it at the 10242resolution in pixel space. It achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models. Code: https://github.com/NVlabs/PixelDiT
翻译:潜在空间建模一直是扩散Transformer(DiT)的标准范式。然而,该方法依赖两阶段流水线:预训练自编码器引入有损重建,这不仅导致误差累积,还阻碍了联合优化。为解决这些问题,我们提出PixelDiT——一种端到端单阶段模型,它消除了自编码器的需求,直接在像素空间中学习扩散过程。PixelDiT采用基于双层级设计的全Transformer架构:补丁级DiT捕获全局语义,像素级DiT精炼纹理细节,从而在保留精细细节的同时实现像素空间扩散模型的高效训练。PixelDiT在ImageNet 256上达到1.61 FID,ImageNet 512上达到1.81 FID,超越了现有像素生成模型。我们进一步将PixelDiT扩展至文生图任务,并在像素空间以1024²分辨率进行预训练。它在GenEval上取得0.74,DPG-bench上取得83.5,接近最优潜在扩散模型。代码:https://github.com/NVlabs/PixelDiT