Large language models (LLMs) have notably accelerated progress towards artificial general intelligence (AGI), with their impressive zero-shot capacity for user-tailored tasks, endowing them with immense potential across a range of applications. However, in the field of computer vision, despite the availability of numerous powerful vision foundation models (VFMs), they are still restricted to tasks in a pre-defined form, struggling to match the open-ended task capabilities of LLMs. In this work, we present an LLM-based framework for vision-centric tasks, termed VisionLLM. This framework provides a unified perspective for vision and language tasks by treating images as a foreign language and aligning vision-centric tasks with language tasks that can be flexibly defined and managed using language instructions. An LLM-based decoder can then make appropriate predictions based on these instructions for open-ended tasks. Extensive experiments show that the proposed VisionLLM can achieve different levels of task customization through language instructions, from fine-grained object-level to coarse-grained task-level customization, all with good results. It's noteworthy that, with a generalist LLM-based framework, our model can achieve over 60\% mAP on COCO, on par with detection-specific models. We hope this model can set a new baseline for generalist vision and language models. The demo shall be released based on https://github.com/OpenGVLab/InternGPT. The code shall be released at https://github.com/OpenGVLab/VisionLLM.
翻译:大型语言模型(LLMs)凭借其在用户定制任务中令人印象深刻的零样本能力,显著加速了通用人工智能(AGI)的进程,使其在众多应用中展现出巨大潜力。然而,在计算机视觉领域,尽管存在大量强大的视觉基础模型(VFMs),它们仍局限于预定义形式的任务,难以匹敌LLMs的开放式任务能力。本文提出一种基于LLM的视觉中心任务框架,称为VisionLLM。该框架通过将图像视为一种外语,并将视觉中心任务与可通过语言指令灵活定义和管理的语言任务对齐,为视觉与语言任务提供了统一视角。基于此,一个LLM解码器可根据这些指令为开放式任务做出适当预测。大量实验表明,所提出的VisionLLM能够通过语言指令实现不同粒度的任务定制,从细粒度的对象级定制到粗粒度的任务级定制,均取得良好效果。值得注意的是,借助基于通用LLM的框架,我们的模型在COCO数据集上可实现超过60%的mAP,与专用于检测的模型性能相当。我们希望该模型能为通用视觉语言模型树立新基准。相关演示将基于https://github.com/OpenGVLab/InternGPT发布,代码将基于https://github.com/OpenGVLab/VisionLLM发布。