While large multimodal models excel in broad vision-language benchmarks, they often struggle with tasks requiring precise perception of low-level visual details, such as comparing line lengths or solving simple mazes. In particular, this failure mode persists in question-answering tasks about vector graphics -- images composed purely of 2D objects and shapes. To address this challenge, we propose the Visually Descriptive Language Model (VDLM), which performs text-based reasoning about vector graphics. VDLM leverages Scalable Vector Graphics (SVG) for a more precise visual description and first uses an off-the-shelf raster-to-SVG algorithm for encoding. Since existing language models cannot understand raw SVGs in a zero-shot setting, VDLM then bridges SVG with pretrained language models through a newly introduced intermediate symbolic representation, Primal Visual Description (PVD), comprising primitive attributes (e.g., shape, position, measurement) with their corresponding predicted values. PVD is task-agnostic and represents visual primitives that are universal across all vector graphics. It can be learned with procedurally generated (SVG, PVD) pairs and also enables the direct use of LLMs for generalization to complex reasoning tasks. By casting an image to a text-based representation, we can leverage the power of language models to learn alignment from SVG to visual primitives and generalize to unseen question-answering tasks. Empirical results show that VDLM achieves stronger zero-shot performance compared to state-of-the-art LMMs, such as GPT-4V, in various low-level multimodal perception and reasoning tasks on vector graphics. We additionally present extensive analyses on VDLM's performance, demonstrating that our framework offers better interpretability due to its disentangled perception and reasoning processes. Project page: https://mikewangwzhl.github.io/VDLM/
翻译:尽管大型多模态模型在广泛的视觉语言基准测试中表现出色,但在需要精确感知低级视觉细节(如比较线段长度或解决简单迷宫)的任务中,它们常常表现不佳。特别是,这种失败模式在涉及矢量图形(完全由二维物体和形状组成的图像)的问答任务中持续存在。为解决这一挑战,我们提出了视觉描述语言模型(VDLM),该模型基于文本对矢量图形进行推理。VDLM利用可缩放矢量图形(SVG)进行更精确的视觉描述,并首先使用现成的光栅到SVG算法进行编码。由于现有语言模型无法在零样本设置下理解原始SVG,VDLM随后通过一种新引入的中间符号表示——原始视觉描述(PVD),将SVG与预训练语言模型桥接起来。PVD包含基本属性(如形状、位置、测量值)及其对应的预测值。PVD是任务无关的,表示所有矢量图形中通用的视觉基元。它可以通过程序生成的(SVG,PVD)对进行学习,并支持直接使用大型语言模型(LLM)推广到复杂推理任务。通过将图像转换为基于文本的表示,我们可以利用语言模型的强大能力来学习从SVG到视觉基元的对齐,并推广到未见过的问答任务。实验结果表明,在矢量图形的各种低级多模态感知与推理任务中,VDLM相比最先进的大规模多模态模型(如GPT-4V)取得了更强的零样本性能。此外,我们提供了对VDLM性能的广泛分析,证明由于感知和推理过程的解耦,我们的框架具有更好的可解释性。项目页面:https://mikewangwzhl.github.io/VDLM/