While Large Vision-Language Models (LVLMs) demonstrate promising multilingual capabilities, their evaluation is currently hindered by two critical limitations: (1) the use of non-parallel corpora, which conflates inherent language capability gaps with dataset artifacts, precluding a fair assessment of cross-lingual alignment; and (2) disjointed multimodal inputs, which deviate from real-world scenarios where most texts are embedded within visual contexts. To address these challenges, we propose PM4Bench, the first Multilingual Multi-Modal Multi-task Benchmark constructed on a strictly parallel corpus across 10 languages. By eliminating content divergence, our benchmark enables a fair comparison of model capabilities across different languages. We also introduce a vision setting where textual queries are visually fused into images, compelling models to jointly "see," "read," and "think". Extensive evaluation of 10 LVLMs uncover a substantial performance drop in the Vision setting compared to standard inputs. Further analysis reveals that OCR capability is not only a general bottleneck but also contributes to cross-lingual performance disparities, suggesting that improving multilingual OCR is essential for advancing LVLM performance. We will release PM4Bench at https://github.com/opendatalab/PM4Bench .
翻译:尽管大型视觉语言模型(LVLMs)展现出令人期待的多语言能力,其评估目前仍受到两个关键限制的阻碍:(1)非平行语料库的使用,将固有的语言能力差距与数据集人为因素混为一谈,阻碍了对跨语言对齐的公平评估;(2)割裂的多模态输入,偏离了现实世界中大多数文本嵌入视觉场景的实际情境。为应对这些挑战,我们提出了PM4Bench,这是首个基于严格平行语料库构建的、涵盖10种语言的多语言多模态多任务基准测试。通过消除内容偏差,我们的基准测试实现了模型在不同语言间能力的公平比较。我们还引入了一种视觉设定,将文本查询视觉融合到图像中,迫使模型同时进行“看”、“读”和“思考”。对10个LVLM的广泛评估发现,与标准输入相比,视觉设定下的性能出现显著下降。进一步分析表明,OCR能力不仅是一个普遍瓶颈,还加剧了跨语言性能差异,这表明提升多语言OCR能力对于推进LVLM性能至关重要。我们将通过 https://github.com/opendatalab/PM4Bench 发布PM4Bench。