Texture or Semantics? Vision-Language Models Get Lost in Font Recognition

Modern Vision-Language Models (VLMs) exhibit remarkable visual and linguistic capabilities, achieving impressive performance in various tasks such as image recognition and object localization. However, their effectiveness in fine-grained tasks remains an open question. In everyday scenarios, individuals encountering design materials, such as magazines, typography tutorials, research papers, or branding content, may wish to identify aesthetically pleasing fonts used in the text. Given their multimodal capabilities and free accessibility, many VLMs are often considered potential tools for font recognition. This raises a fundamental question: Do VLMs truly possess the capability to recognize fonts? To investigate this, we introduce the Font Recognition Benchmark (FRB), a compact and well-structured dataset comprising 15 commonly used fonts. FRB includes two versions: (i) an easy version, where 10 sentences are rendered in different fonts, and (ii) a hard version, where each text sample consists of the names of the 15 fonts themselves, introducing a stroop effect that challenges model perception. Through extensive evaluation of various VLMs on font recognition tasks, we arrive at the following key findings: (i) Current VLMs exhibit limited font recognition capabilities, with many state-of-the-art models failing to achieve satisfactory performance. (ii) Few-shot learning and Chain-of-Thought (CoT) prompting provide minimal benefits in improving font recognition accuracy across different VLMs. (iii) Attention analysis sheds light on the inherent limitations of VLMs in capturing semantic features.

翻译：现代视觉语言模型（VLMs）展现出卓越的视觉与语言能力，在图像识别、物体定位等多种任务中取得了令人瞩目的性能。然而，其在细粒度任务中的有效性仍是一个悬而未决的问题。在日常场景中，当人们接触设计材料（如杂志、字体排版教程、研究论文或品牌内容）时，常希望识别文本中所使用的美观字体。鉴于其多模态能力与免费可获取性，许多VLMs常被视为潜在的字体识别工具。这引发了一个根本性问题：VLMs是否真正具备识别字体的能力？为探究此问题，我们引入了字体识别基准（FRB）——一个紧凑且结构清晰的数据集，包含15种常用字体。FRB包含两个版本：（i）简易版本，其中10个句子以不同字体呈现；（ii）困难版本，每个文本样本由15种字体的名称本身构成，通过引入斯特鲁普效应来挑战模型感知能力。通过对多种VLMs在字体识别任务上的广泛评估，我们得出以下关键发现：（i）当前VLMs的字体识别能力有限，许多先进模型未能达到令人满意的性能。（ii）少样本学习与思维链（CoT）提示在不同VLMs中对提升字体识别准确率的助益甚微。（iii）注意力分析揭示了VLMs在捕捉语义特征方面的固有局限性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日