Multimodal retrieval systems are expected to operate in a semantic space, agnostic to the language or cultural origin of the query. In practice, however, retrieval outcomes systematically reflect perspectival biases: deviations shaped by linguistic prevalence and cultural associations. We study two such biases. First, prevalence bias refers to the tendency to favor entries from prevalent languages over semantically faithful entries in image-to-text retrieval. Second, association bias refers to the tendency to favor images culturally associated with the query over semantically correct ones in text-to-image retrieval. Results show that explicit alignment is a more effective strategy for mitigating prevalence bias. However, association bias remains a distinct and more challenging problem. These findings suggest that achieving truly equitable multimodal systems requires targeted strategies beyond simple data scaling and that bias arising from cultural association may be treated as a more challenging problem than one arising from linguistic prevalence.
翻译:多模态检索系统被期望在语义空间中运行,不受查询语言或文化来源的影响。然而在实践中,检索结果系统性地反映了视角偏差:由语言普遍性和文化关联塑造的偏离。我们研究了两种此类偏差。首先,普遍性偏差指在图像到文本检索中倾向于选择普遍性语言条目而非语义忠实条目的趋势。其次,关联性偏差指在文本到图像检索中倾向于选择与查询文化关联的图像而非语义正确图像的趋势。结果表明,显式对齐是缓解普遍性偏差的更有效策略。然而,关联性偏差仍是一个独特且更具挑战性的问题。这些发现表明,实现真正公平的多模态系统需要超越简单数据扩展的针对性策略,且源于文化关联的偏差可能比源于语言普遍性的偏差更具挑战性。