We introduce Representation Tokenizer (RepTok), a generative modeling framework that represents an image using a single continuous latent token obtained from self-supervised vision transformers. Building on a pre-trained SSL encoder, we fine-tune only the semantic token embedding and pair it with a generative decoder trained jointly using a standard flow matching objective. This adaptation enriches the token with low-level, reconstruction-relevant details, enabling faithful image reconstruction. To preserve the favorable geometry of the original SSL space, we add a cosine-similarity loss that regularizes the adapted token, ensuring the latent space remains smooth and suitable for generation. Our single-token formulation resolves spatial redundancies of 2D latent spaces and significantly reduces training costs. Despite its simplicity and efficiency, RepTok achieves competitive results on class-conditional ImageNet generation and naturally extends to text-to-image synthesis, reaching competitive zero-shot performance on MS-COCO under extremely limited training budgets. Our findings highlight the potential of fine-tuned SSL representations as compact and effective latent spaces for efficient generative modeling.
翻译:本文提出表征分词器(RepTok),一种生成建模框架,其利用从自监督视觉Transformer中获得的单个连续潜在令牌来表示图像。基于预训练的自监督学习编码器,我们仅微调语义令牌嵌入,并将其与通过标准流匹配目标联合训练的生成解码器配对。这种适配使令牌富含与重建相关的低级细节,从而实现精确的图像重建。为保持原始自监督学习空间的优良几何特性,我们添加余弦相似度损失来正则化适配后的令牌,确保潜在空间保持平滑且适合生成任务。我们的单令牌设计解决了二维潜在空间的空间冗余问题,并显著降低了训练成本。尽管结构简单高效,RepTok在类别条件ImageNet生成任务中取得了具有竞争力的结果,并能自然扩展到文本到图像合成任务,在极有限的训练预算下于MS-COCO数据集上达到具有竞争力的零样本性能。我们的研究结果表明,经过微调的自监督学习表征可作为紧凑高效的潜在空间,用于高效的生成建模。