In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos. SPAE converts between raw pixels and interpretable lexical tokens (or words) extracted from the LLM's vocabulary. The resulting tokens capture both the semantic meaning and the fine-grained details needed for visual reconstruction, effectively translating the visual content into a language comprehensible to the LLM, and empowering it to perform a wide array of multimodal tasks. Our approach is validated through in-context learning experiments with frozen PaLM 2 and GPT 3.5 on a diverse set of image understanding and generation tasks. Our method marks the first successful attempt to enable a frozen LLM to generate image content while surpassing state-of-the-art performance in image understanding tasks, under the same setting, by over 25%.
翻译:本文提出语义金字塔自编码器(SPAE),旨在使冻结的大语言模型(LLMs)能够执行涉及图像或视频等非语言模态的理解与生成任务。SPAE可在原始像素与从LLM词汇表中提取的可解释词汇标记(即词语)之间进行转换。所得标记既捕获了语义含义,又保留了视觉重建所需的细粒度细节,从而有效将视觉内容转换为LLM可理解的语言,并使其能够执行广泛的多模态任务。我们通过基于上下文学习的实验验证了该方法,实验使用冻结的PaLM 2和GPT 3.5在多种图像理解与生成任务上展开。我们的方法首次成功实现了冻结LLM的图像内容生成能力,同时在相同设置下,图像理解任务的性能超越了现有最优方法超过25%。