We propose V2Flow, a novel tokenizer that produces discrete visual tokens capable of high-fidelity reconstruction, while ensuring structural and latent distribution alignment with the vocabulary space of large language models (LLMs). Leveraging this tight visual-vocabulary coupling, V2Flow enables autoregressive visual generation on top of existing LLMs. Our approach formulates visual tokenization as a flow-matching problem, aiming to learn a mapping from a standard normal prior to the continuous image distribution, conditioned on token sequences embedded within the LLMs vocabulary space. The effectiveness of V2Flow stems from two core designs. First, we propose a Visual Vocabulary resampler, which compresses visual data into compact token sequences, with each represented as a soft categorical distribution over LLM's vocabulary. This allows seamless integration of visual tokens into existing LLMs for autoregressive visual generation. Second, we present a masked autoregressive Rectified-Flow decoder, employing a masked transformer encoder-decoder to refine visual tokens into contextually enriched embeddings. These embeddings then condition a dedicated velocity field for precise reconstruction. Additionally, an autoregressive rectified-flow sampling strategy is incorporated, ensuring flexible sequence lengths while preserving competitive reconstruction quality. Extensive experiments show that V2Flow outperforms mainstream VQ-based tokenizers and facilitates autoregressive visual generation on top of existing. https://github.com/zhangguiwei610/V2Flow
翻译:我们提出V2Flow,一种新颖的分词器,能够生成具备高保真重建能力的离散视觉标记,同时确保与大型语言模型(LLMs)词汇空间的结构和潜在分布对齐。利用这种紧密的视觉-词汇耦合,V2Flow能够在现有LLMs之上实现自回归视觉生成。我们的方法将视觉分词表述为一个流匹配问题,旨在学习从标准正态先验到连续图像分布的映射,该映射以嵌入在LLMs词汇空间内的标记序列为条件。V2Flow的有效性源于两个核心设计。首先,我们提出一种视觉词汇重采样器,将视觉数据压缩为紧凑的标记序列,其中每个标记表示为LLM词汇上的软分类分布。这使得视觉标记能够无缝集成到现有LLMs中,用于自回归视觉生成。其次,我们提出一种掩码自回归整流流解码器,采用掩码Transformer编码器-解码器将视觉标记细化为上下文丰富的嵌入。这些嵌入随后作为专用速度场的条件,以实现精确重建。此外,我们引入了一种自回归整流流采样策略,在保持竞争力的重建质量的同时,确保了灵活的序列长度。大量实验表明,V2Flow优于主流的基于向量量化的分词器,并能在现有模型之上促进自回归视觉生成。https://github.com/zhangguiwei610/V2Flow