We present Multi-EuP, a new multilingual benchmark dataset, comprising 22K multi-lingual documents collected from the European Parliament, spanning 24 languages. This dataset is designed to investigate fairness in a multilingual information retrieval (IR) context to analyze both language and demographic bias in a ranking context. It boasts an authentic multilingual corpus, featuring topics translated into all 24 languages, as well as cross-lingual relevance judgments. Furthermore, it offers rich demographic information associated with its documents, facilitating the study of demographic bias. We report the effectiveness of Multi-EuP for benchmarking both monolingual and multilingual IR. We also conduct a preliminary experiment on language bias caused by the choice of tokenization strategy.
翻译:摘要:我们提出了Multi-EuP——一个新的多语言基准数据集,包含从欧洲议会收集的22K篇多语言文档,覆盖24种语言。该数据集旨在研究多语言信息检索(IR)环境下的公平性问题,以分析排序情境中的语言偏差和人口统计偏差。它拥有真实的多语言语料库,涵盖翻译成全部24种语言的主题以及跨语言相关性判断。此外,该数据集还提供了与文档关联的丰富人口统计信息,便于研究人口统计偏差。我们报告了Multi-EuP在单语和多语言信息检索基准测试中的有效性,并针对分词策略选择所引发的语言偏差进行了初步实验。