How well do text-only large language models (LLMs) align with the visual world? We present a systematic evaluation of this question by incorporating frozen representations of various language models into a discriminative vision-language framework and measuring zero-shot generalization to novel concepts. We find that decoder-based models exhibit stronger visual alignment than encoders, even when controlling for model and dataset size. Moreover, language modeling performance correlates with visual generalization, suggesting that advances in unimodal LLMs can simultaneously improve vision models. Leveraging these insights, we propose ShareLock, a lightweight method for fusing frozen vision and language backbones. ShareLock achieves robust performance across tasks while drastically reducing the need for paired data and compute. With just 563k image-caption pairs and under one GPU-hour of training, it reaches 51% accuracy on ImageNet. In cross-lingual settings, ShareLock dramatically outperforms CLIP, achieving 38.7% top-1 accuracy on Chinese image classification versus CLIP's 1.4%. Code is available.
翻译:纯文本大型语言模型(LLM)与视觉世界的对齐程度如何?我们通过将不同语言模型的冻结表示融入判别式视觉-语言框架,并测量其对新概念的零样本泛化能力,对这一问题进行了系统评估。研究发现,即使控制模型与数据集规模,基于解码器的模型仍比编码器展现出更强的视觉对齐能力。此外,语言建模性能与视觉泛化能力呈正相关,这表明单模态LLM的进步能同步提升视觉模型性能。基于这些发现,我们提出ShareLock——一种融合冻结视觉与语言主干网络的轻量级方法。ShareLock在多项任务中均实现鲁棒性能,同时大幅减少对配对数据与算力的需求:仅使用56.3万图文对和不足1 GPU小时的训练,即在ImageNet上达到51%的准确率。在跨语言场景中,ShareLock显著优于CLIP模型,其中文图像分类任务Top-1准确率达38.7%,而CLIP仅为1.4%。代码已开源。