With the introduction of transformer-based models for vision and language tasks, such as LLaVA and Chameleon, there has been renewed interest in the discrete tokenized representation of images. These models often treat image patches as discrete tokens, analogous to words in natural language, learning joint alignments between visual and human languages. However, little is known about the statistical behavior of these visual languages - whether they follow similar frequency distributions, grammatical structures, or topologies as natural languages. In this paper, we take a natural-language-centric approach to analyzing discrete visual languages and uncover striking similarities and fundamental differences. We demonstrate that, although visual languages adhere to Zipfian distributions, higher token innovation drives greater entropy and lower compression, with tokens predominantly representing object parts, indicating intermediate granularity. We also show that visual languages lack cohesive grammatical structures, leading to higher perplexity and weaker hierarchical organization compared to natural languages. Finally, we demonstrate that, while vision models align more closely with natural languages than other models, this alignment remains significantly weaker than the cohesion found within natural languages. Through these experiments, we demonstrate how understanding the statistical properties of discrete visual languages can inform the design of more effective computer vision models.
翻译:随着基于Transformer的视觉与语言任务模型(如LLaVA和Chameleon)的引入,人们对图像的离散标记化表示重新产生了兴趣。这些模型通常将图像块视为离散标记,类似于自然语言中的单词,从而学习视觉语言与人类语言之间的联合对齐。然而,人们对这些视觉语言的统计行为知之甚少——它们是否遵循与自然语言相似的频率分布、语法结构或拓扑结构。本文采用以自然语言为中心的方法分析离散视觉语言,揭示了显著的相似性与根本差异。我们证明,尽管视觉语言遵循齐夫分布,但更高的标记创新性会导致更大的熵和更低的压缩率,且标记主要表示物体部件,表明其具有中间粒度。我们还表明,视觉语言缺乏连贯的语法结构,导致其困惑度更高、层次组织性较自然语言更弱。最后,我们证明,尽管视觉模型比其他模型更接近自然语言,但这种对齐性仍显著弱于自然语言内部的凝聚力。通过这些实验,我们阐明了理解离散视觉语言的统计特性如何有助于设计更有效的计算机视觉模型。