In this work, we present a comprehensive three-phase study to examine (1) the effectiveness of large multimodal models (LMMs) in recognizing cultural contexts; (2) the accuracy of their representations of diverse cultures; and (3) their ability to adapt content across cultural boundaries. We first introduce Dalle Street, a large-scale dataset generated by DALL-E 3 and validated by humans, containing 9,935 images of 67 countries and 10 concept classes. We reveal disparities in cultural understanding at the sub-region level with both open-weight (LLaVA) and closed-source (GPT-4V) models on Dalle Street and other existing benchmarks. Next, we assess models' deeper culture understanding by an artifact extraction task and identify over 18,000 artifacts associated with different countries. Finally, we propose a highly composable pipeline, CultureAdapt, to adapt images from culture to culture. Our findings reveal a nuanced picture of the cultural competence of LMMs, highlighting the need to develop culture-aware systems. Dataset and code are available at https://github.com/iamshnoo/crossroads
翻译:本研究提出了一项全面的三阶段研究,旨在考察(1)大型多模态模型(LMMs)在识别文化背景方面的有效性;(2)其表征多元文化的准确性;以及(3)其跨文化边界适应内容的能力。我们首先介绍了Dalle Street,这是一个由DALL-E 3生成并经人工验证的大规模数据集,包含67个国家、10个概念类别的9,935张图像。我们利用Dalle Street及其他现有基准测试,揭示了开源模型(LLaVA)和闭源模型(GPT-4V)在次区域层面的文化理解差异。其次,我们通过一项器物提取任务评估了模型对文化的更深层次理解,并识别出超过18,000件与不同国家相关的器物。最后,我们提出了一个高度可组合的流程CultureAdapt,用于实现图像在文化间的适应。我们的研究结果揭示了LMMs文化能力的一幅细致图景,凸显了开发文化感知系统的必要性。数据集和代码可在 https://github.com/iamshnoo/crossroads 获取。