The differing representation spaces required for visual understanding and generation pose a challenge in unifying them within the autoregressive paradigm of large language models. A vision tokenizer trained for reconstruction excels at capturing low-level visual appearance, making it well-suited for visual generation but lacking high-level semantic representations for understanding tasks. Conversely, a vision encoder trained via contrastive learning aligns well with language but struggles to decode back into the pixel space for generation tasks. To bridge this gap, we propose DualToken, a method that unifies representations for both understanding and generation within a single tokenizer. However, directly integrating reconstruction and semantic objectives creates conflicts, leading to degraded performance in both reconstruction fidelity and semantic accuracy. Instead of forcing a single codebook to capture both visual appearance and semantics, DualToken disentangles them by introducing separate codebooks for high-level semantics and low-level visual details. As a result, DualToken achieves 0.25 rFID and 82.0% zero-shot accuracy on ImageNet, and demonstrates strong effectiveness in downstream MLLM tasks for both understanding and generation. Specifically, our method surpasses VILA-U by 5.8 points on average across ten visual understanding benchmarks and delivers a 13% improvement on GenAI-Bench. Notably, incorporating dual visual tokens outperforms using a single token type on both understanding and generation tasks. We hope our research offers a new perspective on leveraging dual visual vocabularies for building unified vision-language models. Project page is available at https://songweii.github.io/dualtoken-project-page.
翻译:视觉理解与生成所需的表征空间存在差异,这给在大语言模型的自回归范式内统一两者带来了挑战。用于重建训练的视觉分词器擅长捕捉低层视觉外观,因而适合视觉生成任务,但缺乏理解任务所需的高层语义表征。相反,通过对比学习训练的视觉编码器能与语言良好对齐,却难以解码回像素空间以完成生成任务。为弥合这一鸿沟,我们提出DualToken方法,在单一分词器内统一了理解与生成的表征。然而,直接融合重建与语义目标会产生冲突,导致重建保真度和语义精度双双下降。DualToken并非强制单个码本同时捕捉视觉外观与语义,而是通过引入分别对应高层语义和低层视觉细节的独立码本来解耦两者。因此,DualToken在ImageNet上取得了0.25的rFID和82.0%的零样本准确率,并在下游多模态大语言模型(MLLM)的理解与生成任务中展现出强大效能。具体而言,我们的方法在十个视觉理解基准测试中平均超越VILA-U方法5.8个百分点,并在GenAI-Bench上实现了13%的性能提升。值得注意的是,相较于使用单一类型的词元,融合双视觉词元在理解与生成任务上均表现更优。我们期望本研究为利用双视觉词汇表构建统一的视觉-语言模型提供新视角。项目页面访问地址:https://songweii.github.io/dualtoken-project-page。