Multimodal retrieval systems are expected to operate in a semantic space, agnostic to the language or cultural origin of the query. In practice, however, retrieval outcomes systematically reflect perspectival biases: deviations shaped by linguistic prevalence and cultural associations. We introduce the Cross-Cultural, Cross-Modal, Cross-lingual Multimodal (3XCM) benchmark to isolate these effects. Results from our studies indicate that, for image-to-text retrieval, models tend to favor entries from prevalent languages over those that are semantically faithful. For text-to-image retrieval, we observe a consistent "tugging effect" in the joint embedding space between semantic alignment and language-conditioned cultural association. When semantic representations are insufficiently resolved, particularly in low-resource languages, similarity is increasingly governed by culturally familiar visual patterns, leading to systematic association bias in retrieval. Our findings suggest that achieving equitable multimodal retrieval necessitates targeted strategies that explicitly decouple language from culture, rather than relying solely on broader data exposure. This work highlights the need to treat linguistic and cultural biases as distinct, measurable challenges in multimodal representation learning.
翻译:多模态检索系统本应在语义空间中运行,不受查询语言或文化来源的影响。然而在实践中,检索结果系统性地反映出视角偏差:即由语言普遍性和文化关联塑造的偏离。我们引入跨文化、跨模态、跨语言多模态(3XCM)基准以分离这些效应。研究结果表明,在图像到文本检索中,模型倾向于选择普遍性语言的条目而非语义忠实条目。在文本到图像检索中,我们观察到联合嵌入空间中语义对齐与语言条件化文化关联之间存在持续的“牵引效应”。当语义表征解析不足时(尤其在低资源语言中),相似性日益受文化熟悉的视觉模式支配,导致检索中出现系统性关联偏差。我们的研究结果表明,实现公平的多模态检索需要采用针对性策略,将语言与文化显式解耦,而非仅仅依赖更广泛的数据暴露。这项工作强调,在多模态表征学习中,需要将语言偏差与文化偏差视为独立且可度量的挑战。