We introduce JEEM, a benchmark designed to evaluate Vision-Language Models (VLMs) on visual understanding across four Arabic-speaking countries: Jordan, The Emirates, Egypt, and Morocco. JEEM includes the tasks of image captioning and visual question answering, and features culturally rich and regionally diverse content. This dataset aims to assess the ability of VLMs to generalize across dialects and accurately interpret cultural elements in visual contexts. In an evaluation of five prominent open-source Arabic VLMs and GPT-4V, we find that the Arabic VLMs consistently underperform, struggling with both visual understanding and dialect-specific generation. While GPT-4V ranks best in this comparison, the model's linguistic competence varies across dialects, and its visual understanding capabilities lag behind. This underscores the need for more inclusive models and the value of culturally-diverse evaluation paradigms.
翻译:本文提出JEEM基准,旨在评估视觉-语言模型在四个阿拉伯语国家(约旦、阿联酋、埃及、摩洛哥)的视觉理解能力。JEEM涵盖图像描述生成和视觉问答任务,其内容具有文化丰富性与地域多样性。该数据集致力于评估视觉-语言模型在方言间的泛化能力及对视觉场景中文化元素的准确解读能力。通过对五个主流开源阿拉伯语视觉-语言模型及GPT-4V的评估,我们发现阿拉伯语视觉-语言模型在视觉理解与方言特定生成任务上均表现不佳。尽管GPT-4V在此次比较中表现最优,但其语言能力随方言变化呈现波动,且视觉理解能力仍有不足。这凸显了开发更具包容性模型的必要性,以及文化多样性评估范式的价值。