This paper reveals that many open-source large language models (LLMs) lack hierarchical knowledge about our visual world, unaware of even well-established biology taxonomies. This shortcoming makes LLMs a bottleneck for vision LLMs' hierarchical visual recognition (e.g., recognizing Anemone Fish but not Vertebrate). We arrive at these findings using about one million four-choice visual question answering (VQA) tasks constructed from six taxonomies and four image datasets. Interestingly, finetuning a vision LLM using our VQA tasks reaffirms LLMs' bottleneck effect because the VQA tasks improve the LLMs' hierarchical consistency more than the vision LLMs'. We conjecture that one cannot make open-source vision LLMs understand visual concepts hierarchically until LLMs possess corresponding taxonomy knowledge.
翻译:本文揭示了众多开源大语言模型缺乏对视觉世界的层级化认知,甚至无法掌握基本的生物学分类体系。这一缺陷使得大语言模型成为视觉大语言模型实现层级视觉识别(例如识别小丑鱼而非脊椎动物)的瓶颈。我们通过基于六个分类体系与四个图像数据集构建的约百万道四选一视觉问答任务得出了上述发现。值得注意的是,使用我们的视觉问答任务微调视觉大语言模型后,视觉问答任务对大语言模型层级一致性的提升效果显著优于视觉大语言模型本身。我们推测,在大语言模型掌握相应分类知识之前,无法使开源视觉大语言模型真正理解视觉概念的层级关系。