GiT: Towards Generalist Vision Transformer through Universal Language Interface

This paper proposes a simple, yet effective framework, called GiT, simultaneously applicable for various vision tasks only with a vanilla ViT. Motivated by the universality of the Multi-layer Transformer architecture (e.g, GPT) widely used in large language models (LLMs), we seek to broaden its scope to serve as a powerful vision foundation model (VFM). However, unlike language modeling, visual tasks typically require specific modules, such as bounding box heads for detection and pixel decoders for segmentation, greatly hindering the application of powerful multi-layer transformers in the vision domain. To solve this, we design a universal language interface that empowers the successful auto-regressive decoding to adeptly unify various visual tasks, from image-level understanding (e.g., captioning), over sparse perception (e.g., detection), to dense prediction (e.g., segmentation). Based on the above designs, the entire model is composed solely of a ViT, without any specific additions, offering a remarkable architectural simplification. GiT is a multi-task visual model, jointly trained across five representative benchmarks without task-specific fine-tuning. Interestingly, our GiT builds a new benchmark in generalist performance, and fosters mutual enhancement across tasks, leading to significant improvements compared to isolated training. This reflects a similar impact observed in LLMs. Further enriching training with 27 datasets, GiT achieves strong zero-shot results over various tasks. Due to its simple design, this paradigm holds promise for narrowing the architectural gap between vision and language. Code and models will be available at \url{https://github.com/Haiyang-W/GiT}.

翻译：本文提出一个简单而有效的框架GiT，仅使用标准ViT即可同时适用于多种视觉任务。受大语言模型（LLM）中广泛使用的多层Transformer架构（如GPT）的通用性启发，我们试图拓展其应用范围，使其成为强大的视觉基础模型（VFM）。然而，与语言建模不同，视觉任务通常需要特定模块，例如检测中的边界框头和分割中的像素解码器，这极大地阻碍了强大的多层Transformer在视觉领域的应用。为解决此问题，我们设计了一种通用语言接口，使成功的自回归解码能够灵活统一各种视觉任务，涵盖从图像级理解（如描述）、稀疏感知（如检测）到密集预测（如分割）。基于上述设计，整个模型仅由ViT组成，无需任何特定附加模块，实现了显著的结构简化。GiT是一个多任务视觉模型，在五个代表性基准上联合训练，无需针对特定任务微调。有趣的是，我们的GiT在通用任务性能上建立了新基准，并促进了任务间的相互增强，相比独立训练取得了显著改进。这反映了LLM中观察到的类似影响。通过进一步使用27个数据集丰富训练，GiT在多种任务上达到了强大的零样本结果。由于其简单设计，该范式有望缩小视觉与语言之间的结构差距。代码和模型将在\url{https://github.com/Haiyang-W/GiT}发布。