Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in processing vision-language tasks. One of the crux of MLLMs lies in vision tokenization, which involves efficiently transforming input visual signals into feature representations that are most beneficial for LLMs. However, existing vision tokenizers, essential for semantic alignment between vision and language, remain problematic. Existing methods aggressively fragment visual input, corrupting the visual semantic integrity. To address this, this paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok), which groups visual features into semantic units via a dynamic clustering algorithm, flexibly determining the number of tokens based on image complexity. The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features. The proposed MLLM (Setokim) equipped with SeTok significantly demonstrates superior performance across various tasks, as evidenced by our experimental results. The project page is at https://chocowu.github.io/SeTok-web/.
翻译:多模态大语言模型在视觉-语言任务中展现了卓越的能力。其核心挑战之一在于视觉分词技术,即高效地将输入视觉信号转化为对大语言模型最有利的特征表示。然而,现有用于实现视觉与语言语义对齐的视觉分词器仍存在问题:当前方法往往激进地分割视觉输入,破坏了视觉语义的完整性。为解决这一问题,本文提出一种新颖的动态语义等价视觉分词器(SeTok),通过动态聚类算法将视觉特征聚合为语义单元,并根据图像复杂度灵活确定分词数量。由此生成的视觉分词能够有效保持语义完整性,并同时捕获低频与高频视觉特征。实验结果表明,搭载SeTok的所提多模态大语言模型(Setokim)在多项任务上均展现出卓越性能。项目页面:https://chocowu.github.io/SeTok-web/。