We introduce MUSE-VL, a Unified Vision-Language Model through Semantic discrete Encoding for multimodal understanding and generation. Recently, the research community has begun exploring unified models for visual generation and understanding. However, existing vision tokenizers (e.g., VQGAN) only consider low-level information, which makes it difficult to align with texture semantic features. This results in high training complexity and necessitates a large amount of training data to achieve optimal performance. Additionally, their performance is still far from dedicated understanding models. This paper proposes Semantic Discrete Encoding (SDE), which effectively aligns the information of visual tokens and language tokens by adding semantic constraints to the visual tokenizer. This greatly reduces training difficulty and improves the performance of the unified model. The proposed model significantly surpasses the previous state-of-the-art in various vision-language benchmarks and achieves better performance than dedicated understanding models.
翻译:我们提出了MUSE-VL,一种通过语义离散编码实现多模态理解与生成的统一视觉语言模型。近期,研究界已开始探索用于视觉生成与理解的统一模型。然而,现有的视觉分词器(如VQGAN)仅考虑低层信息,难以与纹理语义特征对齐。这导致训练复杂度高,且需要大量训练数据才能达到最佳性能。此外,其性能仍远逊于专用理解模型。本文提出语义离散编码(SDE),通过对视觉分词器施加语义约束,有效对齐视觉标记与语言标记的信息。这大幅降低了训练难度,并提升了统一模型的性能。所提模型在多项视觉语言基准测试中显著超越先前最优方法,并取得了优于专用理解模型的性能。