Multilingual information retrieval (MLIR) considers the problem of ranking documents in several languages for a query expressed in a language that may differ from any of those languages. Recent work has observed that approaches such as combining ranked lists representing a single document language each or using multilingual pretrained language models demonstrate a preference for one language over others. This results in systematic unfair treatment of documents in different languages. This work proposes a language fairness metric to evaluate whether documents across different languages are fairly ranked through statistical equivalence testing using the Kruskal-Wallis test. In contrast to most prior work in group fairness, we do not consider any language to be an unprotected group. Thus our proposed measure, PEER (Probability of EqualExpected Rank), is the first fairness metric specifically designed to capture the language fairness of MLIR systems. We demonstrate the behavior of PEER on artificial ranked lists. We also evaluate real MLIR systems on two publicly available benchmarks and show that the PEER scores align with prior analytical findings on MLIR fairness. Our implementation is compatible with ir-measures and is available at http://github.com/hltcoe/peer_measure.
翻译:多语言信息检索(MLIR)旨在处理以下问题:针对用某种语言表达的查询,对使用多种语言编写的文档进行排序,且查询语言可能不同于文档语言。近期研究发现,诸如合并各单一语言文档排序列表或使用多语言预训练模型等方法,会表现出对某种语言的偏好。这导致不同语言文档在排序中受到系统性不公平对待。本文提出一种语言公平性度量指标,通过基于Kruskal-Wallis检验的统计等价性测试,评估不同语言文档是否获得公平排序。与大多数群体公平性研究不同,我们未将任何语言视为非受保护群体。因此,我们提出的度量指标PEER(概率等期望秩)是首个专门为捕捉MLIR系统语言公平性而设计的方法。我们通过人工排序列表展示了PEER的行为特性,并在两个公开基准数据集上评估了真实MLIR系统,结果表明PEER得分与先前关于MLIR公平性的分析结论一致。我们的实现与ir-measures兼容,并开源于http://github.com/hltcoe/peer_measure。