In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos. SPAE converts between raw pixels and interpretable lexical tokens (or words) extracted from the LLM's vocabulary. The resulting tokens capture both the semantic meaning and the fine-grained details needed for visual reconstruction, effectively translating the visual content into a language comprehensible to the LLM, and empowering it to perform a wide array of multimodal tasks. Our approach is validated through in-context learning experiments with frozen PaLM 2 and GPT 3.5 on a diverse set of image understanding and generation tasks. Our method marks the first successful attempt to enable a frozen LLM to generate image content while surpassing state-of-the-art performance in image understanding tasks, under the same setting, by over 25%.
翻译:本文提出语义金字塔自编码器(SPAE),旨在使冻结的大语言模型能够执行涉及图像、视频等非语言模态的理解与生成任务。SPAE在原始像素与大语言模型词汇表中的可解释词汇标记之间建立转换机制。由此生成的标记既包含语义含义,又保留视觉重建所需的细粒度细节,从而将视觉内容转化为大语言模型可理解的语言形式,使其能够执行广泛的多模态任务。我们通过在冻结的PaLM 2和GPT 3.5模型上开展上下文学习实验,在多种图像理解与生成任务中验证了该方法。本方法首次成功实现冻结大语言模型生成图像内容,同时在相同设置下,图像理解任务性能超越现有最优方法达25%以上。