Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in processing vision-language tasks. One of the crux of MLLMs lies in vision tokenization, which involves efficiently transforming input visual signals into feature representations that are most beneficial for LLMs. However, existing vision tokenizers, essential for semantic alignment between vision and language, remain problematic. Existing methods aggressively fragment visual input, corrupting the visual semantic integrity. To address this, this paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok), which groups visual features into semantic units via a dynamic clustering algorithm, flexibly determining the number of tokens based on image complexity. The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features. The proposed MLLM (Setokim) equipped with SeTok significantly demonstrates superior performance across various tasks, as evidenced by our experimental results. The project page is at https://chocowu.github.io/SeTok-web/.
翻译:多模态大语言模型(MLLMs)在处理视觉-语言任务中展现出卓越能力。MLLMs的关键之一在于视觉词元化,即如何将输入视觉信号高效转化为对大语言模型最有益的特征表示。然而,现有视觉词元器——作为实现视觉与语言语义对齐的核心组件——仍存在问题。现有方法对视觉输入进行过度分割,破坏了视觉语义的完整性。为此,本文提出一种新颖的动态语义等价视觉词元器(SeTok),它通过动态聚类算法将视觉特征分组为语义单元,并依据图像复杂度灵活确定词元数量。所生成的视觉词元能有效保持语义完整性,同时捕捉低频与高频视觉特征。实验结果表明,搭载SeTok的MLLM(Setokim)在多项任务中均表现出显著优越性能。项目页面位于 https://chocowu.github.io/SeTok-web/。