Visual representation learning has been a cornerstone in computer vision, evolving from supervised learning with human-annotated labels to aligning image-text pairs from the Internet. Despite recent advancements in multi-modal large language models (MLLMs), the visual representations they rely on, such as CLIP embeddings, often lack access to external world knowledge critical for real-world visual reasoning. In this work, we propose Visual Table, a novel visual representation tailored for MLLMs. It provides hierarchical text descriptions of holistic visual scenes, consisting of a scene description and multiple object-centric descriptions that encompass categories, attributes, and knowledge at instance level. We further develop a scalable generator for visual table generation and train it on small-scale annotations from GPT4V. Extensive evaluations demonstrate that, with generated visual tables as additional visual representations, our model can consistently outperform the state-of-the-art (SOTA) MLLMs across diverse benchmarks. When visual tables serve as standalone visual representations, our model can closely match or even beat the SOTA MLLMs that are built on CLIP visual embeddings. Our code is available at https://github.com/LaVi-Lab/Visual-Table.
翻译:视觉表征学习一直是计算机视觉的基石,从依赖人工标注的监督学习发展到利用互联网图像-文本对进行对齐。尽管多模态大语言模型(MLLMs)近期取得了进展,但其依赖的视觉表征(如CLIP嵌入)往往缺乏真实世界视觉推理所需的外部世界知识。本文提出视觉表格(Visual Table),一种专为MLLMs设计的新型视觉表征。该表征提供整体视觉场景的分层文本描述,包含场景描述及多个以对象为中心的描述,这些描述涵盖实例级别的类别、属性和知识。我们进一步开发了可扩展的视觉表格生成器,并在GPT4V的小规模标注上进行训练。广泛评估表明,将生成的视觉表格作为额外视觉表征后,我们的模型在多个基准测试中持续超越最先进(SOTA)MLLMs。当视觉表格作为独立视觉表征时,我们的模型能与基于CLIP视觉嵌入的SOTA MLLMs紧密匹配甚至超越。我们的代码开源在https://github.com/LaVi-Lab/Visual-Table。