Current multilingual evaluations for Vision-Language Models (VLMs) assume a one-to-one mapping between language and orthography, overlooking billions of users of multi-script languages. We introduce PuMVR (Punjabi Multimodal Visual Reasoning), a benchmark of 1,000 strictly parallel image-text instances across Punjabi's three active scripts: Gurmukhi, Shahmukhi, and Roman. Evaluating 10 state-of-the-art VLMs, we expose a substantial and systematic Script Gap. Models frequently solve visual tasks in one script while failing identical tasks in another, with accuracy deltas reaching 16%. Crucially, visual input boosts absolute performance uniformly yet does not close the orthographic gap. Furthermore, cross-script in-context transfer is highly brittle, exposing script-locked knowledge representation. Supported by McNemar tests across all script pairs, our findings demonstrate that current "multilingual" VLMs are not truly multi-script. We propose the Script Consistency Rate (SCR), which falls as low as 24.8% on our benchmark, as a mandatory metric for script-agnostic evaluation to ensure equitable AI access. Data and code are available at: https://github.com/prabhjotschugh/Not-Truly-Multilingual-PuMVR.
翻译:当前视觉语言模型(VLM)的多语言评估假设语言与正字法之间存在一一对应关系,却忽视了使用多脚本语言的数十亿用户。我们提出旁遮普多模态视觉推理(PuMVR)基准数据集,包含1000个严格平行的图像-文本实例,覆盖旁遮普语的三种活跃文字系统:古尔穆基文、沙穆基文和罗马文。通过对10个最先进的VLM进行评估,我们揭示了一个显著且系统性的文字差距(Script Gap)。模型常能通过一种文字解决视觉任务,却在另一种文字上对相同任务失败,准确率差异最高达16%。关键在于,视觉输入虽能统一提升绝对性能,却无法弥合正字法差距。此外,跨文字的上下文内迁移高度脆弱,暴露出文字锁定的知识表征机制。基于所有文字对间McNemar检验的支持,我们的发现表明当前"多语言"VLM并非真正的多文字系统。我们提出脚本一致性率(SCR)作为文字无关评估的强制性指标,该指标在本基准数据集上最低仅达24.8%。数据和代码均发布于:https://github.com/prabhjotschugh/Not-Truly-Multilingual-PuMVR。