We present a comprehensive three-phase study to examine (1) the cultural understanding of Large Multimodal Models (LMMs) by introducing DalleStreet, a large-scale dataset generated by DALL-E 3 and validated by humans, containing 9,935 images of 67 countries and 10 concept classes; (2) the underlying implicit and potentially stereotypical cultural associations with a cultural artifact extraction task; and (3) an approach to adapt cultural representation in an image based on extracted associations using a modular pipeline, CultureAdapt. We find disparities in cultural understanding at geographic sub-region levels with both open-source (LLaVA) and closed-source (GPT-4V) models on DalleStreet and other existing benchmarks, which we try to understand using over 18,000 artifacts that we identify in association to different countries. Our findings reveal a nuanced picture of the cultural competence of LMMs, highlighting the need to develop culture-aware systems. Dataset and code are available at https://github.com/iamshnoo/crossroads
翻译:我们提出了一项全面的三阶段研究,旨在考察:(1) 大型多模态模型的文化理解能力,为此我们引入了DalleStreet——一个由DALL-E 3生成并经人工验证的大规模数据集,包含67个国家、10个概念类别的9,935张图像;(2) 通过文化文物提取任务揭示潜在的隐性及可能刻板的文化关联;(3) 提出一种基于提取的关联来调整图像中文化表征的方法,该方法采用模块化流程CultureAdapt。通过在DalleStreet及其他现有基准测试中对开源模型和闭源模型进行评估,我们发现模型在地理亚区域层面的文化理解存在差异。为探究此现象,我们识别了超过18,000个与不同国家相关联的文化文物。我们的研究结果揭示了LMMs文化能力的复杂图景,凸显了开发文化感知系统的必要性。数据集与代码公开于https://github.com/iamshnoo/crossroads。