In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos. SPAE converts between raw pixels and interpretable lexical tokens (or words) extracted from the LLM's vocabulary. The resulting tokens capture both the semantic meaning and the fine-grained details needed for visual reconstruction, effectively translating the visual content into a language comprehensible to the LLM, and empowering it to perform a wide array of multimodal tasks. Our approach is validated through in-context learning experiments with frozen PaLM 2 and GPT 3.5 on a diverse set of image understanding and generation tasks. Our method marks the first successful attempt to enable a frozen LLM to generate image content while surpassing state-of-the-art performance in image understanding tasks, under the same setting, by over 25%.
翻译:本文提出语义金字塔自编码器(SPAE),用于使冻结的大语言模型能够执行涉及图像或视频等非语言模态的理解与生成任务。SPAE在原始像素与大语言模型词汇表中提取的可解释词元(即词汇)之间进行转换。所生成的词元同时捕获语义含义与视觉重建所需的细粒度细节,有效将视觉内容转化为大语言模型可理解的语言,赋予其执行广泛多模态任务的能力。通过对冻结的PaLM 2和GPT 3.5在多种图像理解与生成任务中进行上下文学习实验,验证了本方法的有效性。本方法首次成功实现冻结大语言模型生成图像内容,同时在相同设定下,图像理解任务性能超越现有最优方法25%以上。