Evaluating Perspectival Biases in Cross-Modal Retrieval

Teerapol Saengsukhiran,Peerawat Chomphooyod,Narabodee Rodjananant,Chompakorn Chaksangchaichot,Patawee Prakrankamanant,Witthawin Sripheanpol,Pak Lovichit,Sarana Nutanong,Ekapol Chuangsuwanich

Multimodal retrieval systems are expected to operate in a semantic space, agnostic to the language or cultural origin of the query. In practice, however, retrieval outcomes systematically reflect perspectival biases: deviations shaped by linguistic prevalence and cultural associations. We study two such biases. First, prevalence bias refers to the tendency to favor entries from prevalent languages over semantically faithful entries in image-to-text retrieval. Second, association bias refers to the tendency to favor images culturally associated with the query over semantically correct ones in text-to-image retrieval. Results show that explicit alignment is a more effective strategy for mitigating prevalence bias. However, association bias remains a distinct and more challenging problem. These findings suggest that achieving truly equitable multimodal systems requires targeted strategies beyond simple data scaling and that bias arising from cultural association may be treated as a more challenging problem than one arising from linguistic prevalence.

翻译：多模态检索系统被期望在语义空间中运行，对查询的语言或文化来源保持不可知。然而，在实践中，检索结果系统性地反映了视角偏差：由语言普遍性和文化关联塑造的偏离。我们研究了两种此类偏差。首先，普遍性偏差是指在图像到文本检索中，倾向于优先选择普遍性语言条目而非语义忠实条目的趋势。其次，关联性偏差是指在文本到图像检索中，倾向于优先选择与文化关联的图像而非语义正确图像的趋势。结果表明，显式对齐是缓解普遍性偏差的更有效策略。然而，关联性偏差仍然是一个独特且更具挑战性的问题。这些发现表明，实现真正公平的多模态系统需要超越简单数据扩展的针对性策略，并且由文化关联产生的偏差可能比由语言普遍性产生的偏差更具挑战性。