While reasoning rerankers, such as Rank1, have demonstrated strong abilities in improving ranking relevance, it is unclear how they perform on other retrieval qualities such as fairness. We conduct the first systematic comparison of fairness between reasoning and non-reasoning rerankers. Using the TREC 2022 Fair Ranking Track dataset, we evaluate six reranking models across multiple retrieval settings and demographic attributes. Our findings demonstrate reasoning neither improve nor harm fairness compared to non-reasoning approaches. Our fairness metric, Attention-Weighted Rank Fairness (AWRF) remained stable (0.33-0.35) across all models, even as relevance varies substantially (nDCG 0.247-1.000). Demographic breakdown analysis revealed fairness gaps for geographic attributes regardless of model architecture. These results indicate that future work in specializing reasoning models to be aware of fairness attributes could lead to improvements, as current implementations preserve the fairness characteristics of their input ranking.
翻译:尽管推理重排序器(如Rank1)在提升排序相关性方面展现出强大能力,但其在其他检索质量指标(如公平性)上的表现尚不明确。本研究首次对推理与非推理重排序器的公平性进行了系统性比较。基于TREC 2022公平排序赛道数据集,我们在多检索场景与人口属性维度下评估了六种重排序模型。研究结果表明:相较于非推理方法,推理既未改善也未损害公平性。我们的公平性指标——注意力加权排序公平性(AWRF)在所有模型中保持稳定(0.33-0.35),即使相关性指标存在显著差异(nDCG 0.247-1.000)。人口属性细分分析显示,无论模型架构如何,地理属性维度均存在公平性差距。这些结果表明,未来通过专门优化推理模型以感知公平性属性可能带来改进空间,因为当前实现方案仅能维持输入排序的公平性特征。