Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models

Visual representation learning has been a cornerstone in computer vision, evolving from supervised learning with human-annotated labels to aligning image-text pairs from the Internet. Despite recent advancements in multi-modal large language models (MLLMs), the visual representations they rely on, such as CLIP embeddings, often lack access to external world knowledge critical for real-world visual reasoning. In this work, we propose Visual Table, a novel visual representation tailored for MLLMs. It provides hierarchical text descriptions of holistic visual scenes, consisting of a scene description and multiple object-centric descriptions that encompass categories, attributes, and knowledge at instance level. We further develop a scalable generator for visual table generation and train it on small-scale annotations from GPT4V. Extensive evaluations demonstrate that, with generated visual tables as additional visual representations, our model can consistently outperform the state-of-the-art (SOTA) MLLMs across diverse benchmarks. When visual tables serve as standalone visual representations, our model can closely match or even beat the SOTA MLLMs that are built on CLIP visual embeddings. Our code is available at https://github.com/LaVi-Lab/Visual-Table.

翻译：视觉表征学习一直是计算机视觉的基石，从依赖人工标注的监督学习发展到利用互联网图像-文本对进行对齐。尽管多模态大语言模型（MLLMs）近期取得了进展，但其依赖的视觉表征（如CLIP嵌入）往往缺乏真实世界视觉推理所需的外部世界知识。本文提出视觉表格（Visual Table），一种专为MLLMs设计的新型视觉表征。该表征提供整体视觉场景的分层文本描述，包含场景描述及多个以对象为中心的描述，这些描述涵盖实例级别的类别、属性和知识。我们进一步开发了可扩展的视觉表格生成器，并在GPT4V的小规模标注上进行训练。广泛评估表明，将生成的视觉表格作为额外视觉表征后，我们的模型在多个基准测试中持续超越最先进（SOTA）MLLMs。当视觉表格作为独立视觉表征时，我们的模型能与基于CLIP视觉嵌入的SOTA MLLMs紧密匹配甚至超越。我们的代码开源在https://github.com/LaVi-Lab/Visual-Table。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/