Diffusion models have become the dominant paradigm for image generation and editing, with latent diffusion models shifting denoising to a compact latent space for efficiency and scalability. Recent attempts to leverage pretrained visual representation models as tokenizer priors either align diffusion features to representation features or directly reuse representation encoders as frozen tokenizers. Although such approaches can improve generation metrics, they often suffer from limited reconstruction fidelity due to frozen encoders, which in turn degrades editing quality, as well as overly high-dimensional latents that make diffusion modeling difficult. To address these limitations, We propose Representation-Pivoted AutoEncoder, a representation-based tokenizer that improves both generation and editing. We introduce Representation-Pivot Regularization, a training strategy that enables a representation-initialized encoder to be fine-tuned for reconstruction while preserving the semantic structure of the pretrained representation space, followed by a variational bridge which compress latent space into a compact one for better diffusion modeling. We adopt an objective-decoupled stage-wise training strategy that sequentially optimizes generative tractability and reconstruction-fidelity objectives. Together, these components yield a tokenizer that preserves strong semantics, reconstructs faithfully, and produces latents with reduced diffusion modeling complexity. Experiments demonstrate that RPiAE outperforms other visual tokenizers on text-to-image generation and image editing, while delivering the best reconstruction fidelity among representation-based tokenizers.
翻译:扩散模型已成为图像生成与编辑的主导范式,其中潜在扩散模型通过将去噪过程迁移至紧凑潜空间以实现效率与可扩展性。近期利用预训练视觉表示模型作为分词器先验的方法,要么将扩散特征对齐至表示特征,要么直接复用表示编码器作为冻结分词器。尽管此类方法可提升生成指标,但由于编码器冻结导致重建保真度受限,进而损害编辑质量,同时过高的潜变量维度也给扩散建模造成困难。为应对这些局限,我们提出表示枢轴自编码器(Representation-Pivoted AutoEncoder)——一种基于表示的分词器,同时改善生成与编辑性能。我们引入表示枢轴正则化(Representation-Pivot Regularization)训练策略,使经表示初始化的编码器能够在重建中微调同时保留预训练表示空间的语义结构,并辅以变分桥接模块将潜空间压缩至更紧凑的形式以优化扩散建模。我们采用目标解耦的分阶段训练策略,依次优化生成可追踪性与重建保真度目标。这些组件共同构建出的分词器能够保持强语义性、实现高保真重建,并产出低扩散建模复杂度的潜变量。实验表明,RPiAE在文本到图像生成与图像编辑任务中优于其他视觉分词器,且在所有基于表示的分词器中实现了最佳重建保真度。