Vision-Language Models (VLMs) have achieved impressive performance in cross-modal understanding across textual and visual inputs, yet existing benchmarks predominantly focus on pure-text queries. In real-world scenarios, language also frequently appears as visualized text embedded in images, raising the question of whether current VLMs handle such input requests comparably. We introduce VISTA-Bench, a systematic benchmark from multimodal perception, reasoning, to unimodal understanding domains. It evaluates visualized text understanding by contrasting pure-text and visualized-text questions under controlled rendering conditions. Extensive evaluation of over 20 representative VLMs reveals a pronounced modality gap: models that perform well on pure-text queries often degrade substantially when equivalent semantic content is presented as visualized text. This gap is further amplified by increased perceptual difficulty, highlighting sensitivity to rendering variations despite unchanged semantics. Overall, VISTA-Bench provides a principled evaluation framework to diagnose this limitation and to guide progress toward more unified language representations across tokenized text and pixels. The source dataset is available at https://github.com/QingAnLiu/VISTA-Bench.
翻译:视觉语言模型(VLMs)在跨文本与视觉输入的模态理解方面取得了令人瞩目的性能,然而现有基准测试主要集中于纯文本查询。在现实场景中,语言也频繁以嵌入图像的可视化文本形式出现,这引发了一个问题:当前的VLMs是否能以相当的水平处理此类输入请求。我们提出了VISTA-Bench,一个从多模态感知、推理到单模态理解领域的系统性基准。它通过在受控渲染条件下对比纯文本与可视化文本问题,来评估对可视化文本的理解能力。对超过20个代表性VLMs的广泛评估揭示了一个显著的模态差距:在纯文本查询上表现良好的模型,当等效语义内容以可视化文本形式呈现时,其性能往往大幅下降。这种差距会随着感知难度的增加而进一步放大,突显了尽管语义未变,模型对渲染变化的敏感性。总体而言,VISTA-Bench提供了一个原则性的评估框架,用于诊断这一局限性,并指导朝着在词元化文本与像素之间实现更统一语言表征的方向取得进展。源数据集可在 https://github.com/QingAnLiu/VISTA-Bench 获取。