In a highly globalized world, it is important for multi-modal large language models (MLLMs) to recognize and respond correctly to mixed-cultural inputs. For example, a model should correctly identify kimchi (Korean food) in an image both when an Asian woman is eating it, as well as an African man is eating it. However, current MLLMs show an over-reliance on the visual features of the person, leading to misclassification of the entities. To examine the robustness of MLLMs to different ethnicity, we introduce MixCuBe, a cross-cultural bias benchmark, and study elements from five countries and four ethnicities. Our findings reveal that MLLMs achieve both higher accuracy and lower sensitivity to such perturbation for high-resource cultures, but not for low-resource cultures. GPT-4o, the best-performing model overall, shows up to 58% difference in accuracy between the original and perturbed cultural settings in low-resource cultures. Our dataset is publicly available at: https://huggingface.co/datasets/kyawyethu/MixCuBe.
翻译:在高度全球化的世界中,多模态大语言模型(MLLMs)能够正确识别并响应混合文化输入至关重要。例如,模型应当正确识别图像中的泡菜(韩国食物),无论是一位亚洲女性在食用它,还是一位非洲男性在食用它。然而,当前的MLLMs表现出对人物视觉特征的过度依赖,导致实体分类错误。为了检验MLLMs对不同种族群体的鲁棒性,我们引入了MixCuBe——一个跨文化偏见基准,并研究了来自五个国家和四个种族群体的元素。我们的研究结果表明,对于高资源文化,MLLMs在此类扰动下实现了更高的准确性和更低的敏感性,但对于低资源文化则不然。整体表现最佳的模型GPT-4o,在低资源文化中,原始文化设置与扰动文化设置之间的准确率差异高达58%。我们的数据集公开于:https://huggingface.co/datasets/kyawyethu/MixCuBe。