Visual representation learning has been a cornerstone in computer vision, involving typical forms such as visual embeddings, structural symbols, and text-based representations. Despite the success of CLIP-type visual embeddings, they often lack access to world knowledge critical for visual reasoning. In this work, we propose Visual Table, a novel form of visual representation tailored for visual reasoning. Visual tables are constructed as hierarchical descriptions of visual scenes, featuring a scene description and multiple object-centric descriptions covering categories, attributes, and knowledge. Thanks to the structural and textual formats, visual tables offer unique advantages over mere visual embeddings, such as interpretability and controllable editing. Furthermore, they deliver instance-level world knowledge and detailed attributes that are essential for visual reasoning. To create visual tables, we develop a generator trained on the dataset with collected, small-scale annotations. Extensive results on 11 visual reasoning benchmarks demonstrate that the generated visual tables significantly outperform previous structural and text-based representations. Moreover, they consistently enhance state-of-the-art multimodal large language models across diverse benchmarks, showcasing their potential for advancing visual reasoning tasks. Our code is available at https://github.com/LaVi-Lab/Visual-Table.
翻译:视觉表征学习一直是计算机视觉领域的基石,其典型形式包括视觉嵌入、结构化符号和基于文本的表征。尽管CLIP类视觉嵌入取得了成功,但它们往往缺乏对视觉推理至关重要的世界知识。在本研究中,我们提出了一种专为视觉推理设计的新型视觉表征形式——视觉表格。视觉表格构建为视觉场景的层次化描述,包含场景描述和多个以对象为中心的描述,涵盖类别、属性和知识。得益于结构化和文本化的格式,视觉表格相较于单纯的视觉嵌入具有独特优势,例如可解释性和可控编辑性。此外,它们提供了实例级的世界知识和详细属性,这对视觉推理至关重要。为创建视觉表格,我们开发了一个基于收集的小规模标注数据集训练的生成器。在11个视觉推理基准测试上的广泛结果表明,生成的视觉表格显著优于以往的结构化和文本化表征。此外,它们能持续提升最先进的多模态大语言模型在多样化基准测试中的表现,彰显了其在推进视觉推理任务方面的潜力。我们的代码发布于https://github.com/LaVi-Lab/Visual-Table。