EvoTok: A Unified Image Tokenizer via Residual Latent Evolution for Visual Understanding and Generation

The development of unified multimodal large language models (MLLMs) is fundamentally challenged by the granularity gap between visual understanding and generation: understanding requires high-level semantic abstractions, while image generation demands fine-grained pixel-level representations. Existing approaches usually enforce the two supervision on the same set of representation or decouple these two supervision on separate feature spaces, leading to interference and inconsistency, respectively. In this work, we propose EvoTok, a unified image tokenizer that reconciles these requirements through a residual evolution process within a shared latent space. Instead of maintaining separate token spaces for pixels and semantics, EvoTok encodes an image into a cascaded sequence of residual tokens via residual vector quantization. This residual sequence forms an evolution trajectory where earlier stages capture low-level details and deeper stages progressively transition toward high-level semantic representations. Despite being trained on a relatively modest dataset of 13M images, far smaller than the billion-scale datasets used by many previous unified tokenizers, EvoTok achieves a strong reconstruction quality of 0.43 rFID on ImageNet-1K at 256x256 resolution. When integrated with a large language model, EvoTok shows promising performance across 7 out of 9 visual understanding benchmarks, and remarkable results on image generation benchmarks such as GenEval and GenAI-Bench. These results demonstrate that modeling visual representations as an evolving trajectory provides an effective and principled solution for unifying visual understanding and generation.

翻译：统一多模态大语言模型（MLLM）的发展面临一个根本性挑战：视觉理解与图像生成之间存在粒度鸿沟。理解任务需要高级语义抽象，而图像生成则要求细粒度的像素级表示。现有方法通常在同一组表示上强制执行这两种监督，或将这两种监督解耦到不同的特征空间，分别导致干扰和不一致。在本工作中，我们提出了EvoTok，一种统一的图像分词器，它通过在共享潜在空间内的残差演化过程来协调这些需求。EvoTok并非为像素和语义维护独立的词元空间，而是通过残差向量量化将图像编码为一个级联的残差词元序列。该残差序列形成了一条演化轨迹，其中较早阶段捕获低级细节，更深阶段则逐步过渡到高级语义表示。尽管仅在相对较小的1300万张图像数据集上进行训练，远小于以往许多统一分词器使用的数十亿规模数据集，EvoTok在256x256分辨率下于ImageNet-1K上实现了0.43 rFID的强重建质量。当与一个大语言模型集成时，EvoTok在9个视觉理解基准测试中的7个上展现出有前景的性能，并在图像生成基准测试（如GenEval和GenAI-Bench）上取得了显著成果。这些结果表明，将视觉表示建模为演化轨迹为统一视觉理解与生成提供了一种有效且原理性的解决方案。