For multimodal LLMs, the synergy of visual comprehension (textual output) and generation (visual output) presents an ongoing challenge. This is due to a conflicting objective: for comprehension, an MLLM needs to abstract the visuals; for generation, it needs to preserve the visuals as much as possible. Thus, the objective is a dilemma for visual-tokens. To resolve the conflict, we propose encoding images into morph-tokens to serve a dual purpose: for comprehension, they act as visual prompts instructing MLLM to generate texts; for generation, they take on a different, non-conflicting role as complete visual-tokens for image reconstruction, where the missing visual cues are recovered by the MLLM. Extensive experiments show that morph-tokens can achieve a new SOTA for multimodal comprehension and generation simultaneously. Our project is available at https://github.com/DCDmllm/MorphTokens.
翻译:针对多模态大语言模型(MLLM),视觉理解(文本输出)与视觉生成(图像输出)的协同仍是一项持续性挑战。这是因为二者存在相互矛盾的优化目标:理解任务要求模型对视觉信息进行抽象化处理,而生成任务则需要尽可能保留视觉细节。因此,视觉令牌的设计面临着根本性的困境。为解决这一矛盾,我们提出将图像编码为形态令牌以承担双重职能:在理解任务中,它们作为视觉提示引导MLLM生成文本;在生成任务中,它们以无冲突的方式转型为完整的视觉令牌用于图像重建,缺失的视觉信息由MLLM自行补全。大量实验表明,形态令牌可同时实现多模态理解与生成任务的新最优结果(SOTA)。项目代码已开源:https://github.com/DCDmllm/MorphTokens。