We introduce CogVLM, a powerful open-source visual language foundation model. Different from the popular shallow alignment method which maps image features into the input space of language model, CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers. As a result, CogVLM enables deep fusion of vision language features without sacrificing any performance on NLP tasks. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and ranks the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B. Codes and checkpoints are available at https://github.com/THUDM/CogVLM.
翻译:我们提出了CogVLM——一个强大的开源视觉语言基础模型。与流行的浅层对齐方法(将图像特征映射到语言模型的输入空间)不同,CogVLM通过在注意力层和前馈神经网络层中引入可训练的视觉专家模块,弥合了冻结的预训练语言模型与图像编码器之间的差距。由此,CogVLM实现了视觉语言特征的深度融合,且不会牺牲任何自然语言处理任务的性能。CogVLM-17B在10个经典跨模态基准测试中取得了最先进性能,包括NoCaps、Flicker30k字幕生成、RefCOCO、RefCOCO+、RefCOCOg、Visual7W、GQA、ScienceQA、VizWiz VQA和TDIUC,并在VQAv2、OKVQA、TextVQA、COCO字幕生成等任务中排名第二,超越或媲美PaLI-X 55B。代码及模型权重已开源至https://github.com/THUDM/CogVLM。