We present TokenFlow, a novel unified image tokenizer that bridges the long-standing gap between multimodal understanding and generation. Prior research attempt to employ a single reconstruction-targeted Vector Quantization (VQ) encoder for unifying these two tasks. We observe that understanding and generation require fundamentally different granularities of visual information. This leads to a critical trade-off, particularly compromising performance in multimodal understanding tasks. TokenFlow addresses this challenge through an innovative dual-codebook architecture that decouples semantic and pixel-level feature learning while maintaining their alignment via a shared mapping mechanism. This design enables direct access to both high-level semantic representations crucial for understanding tasks and fine-grained visual features essential for generation through shared indices. Our extensive experiments demonstrate TokenFlow's superiority across multiple dimensions. Leveraging TokenFlow, we demonstrate for the first time that discrete visual input can surpass LLaVA-1.5 13B in understanding performance, achieving a 7.2\% average improvement. For image reconstruction, we achieve a strong FID score of 0.63 at 384*384 resolution. Moreover, TokenFlow establishes state-of-the-art performance in autoregressive image generation with a GenEval score of 0.55 at 256*256 resolution, achieving comparable results to SDXL.
翻译:本文提出TokenFlow,一种新颖的统一图像分词器,旨在弥合多模态理解与生成之间长期存在的鸿沟。先前研究尝试采用单一的重建目标向量量化(VQ)编码器来统一这两类任务。我们观察到,理解与生成任务本质上需要不同粒度的视觉信息,这导致了一个关键权衡,尤其会损害多模态理解任务的性能。TokenFlow通过创新的双码本架构应对这一挑战,该架构在解耦语义级与像素级特征学习的同时,通过共享映射机制保持二者的对齐。此设计使得模型能够通过共享索引直接访问对理解任务至关重要的高层语义表示,以及生成任务所必需的细粒度视觉特征。大量实验证明了TokenFlow在多维度上的优越性。基于TokenFlow,我们首次证明了离散视觉输入在理解性能上能够超越LLaVA-1.5 13B,实现了7.2%的平均提升。在图像重建方面,我们在384*384分辨率下取得了0.63的优异FID分数。此外,TokenFlow在自回归图像生成中实现了最先进的性能,在256*256分辨率下获得0.55的GenEval分数,取得了与SDXL相当的结果。