Recently, large language models (LLMs) have made significant advancements in natural language understanding and generation. However, their potential in computer vision remains largely unexplored. In this paper, we introduce a new, exploratory approach that enables LLMs to process images using the Scalable Vector Graphics (SVG) format. By leveraging the XML-based textual descriptions of SVG representations instead of raster images, we aim to bridge the gap between the visual and textual modalities, allowing LLMs to directly understand and manipulate images without the need for parameterized visual components. Our method facilitates simple image classification, generation, and in-context learning using only LLM capabilities. We demonstrate the promise of our approach across discriminative and generative tasks, highlighting its (i) robustness against distribution shift, (ii) substantial improvements achieved by tapping into the in-context learning abilities of LLMs, and (iii) image understanding and generation capabilities with human guidance. Our code, data, and models can be found here https://github.com/mu-cai/svg-llm.
翻译:近期,大规模语言模型(LLMs)在自然语言理解与生成方面取得了显著进展。然而,它们在计算机视觉领域的潜力尚未得到充分挖掘。本文提出了一种新颖的探索性方法,使LLMs能够通过可缩放矢量图形(SVG)格式处理图像。通过利用SVG表示中基于XML的文本描述来替代光栅图像,我们旨在弥合视觉模态与文本模态之间的鸿沟,使LLMs无需参数化视觉组件即可直接理解并操作图像。我们的方法仅凭借LLM能力即可实现简单的图像分类、生成及上下文学习。我们在判别与生成任务中展示了该方法的潜力,重点揭示了其:(i)对分布偏移的鲁棒性,(ii)通过利用LLMs的上下文学习能力取得的实质性改进,以及(iii)在人类引导下的图像理解与生成能力。我们的代码、数据及模型可访问https://github.com/mu-cai/svg-llm获取。