Relevance and fairness are two major objectives of recommender systems (RSs). Recent work proposes measures of RS fairness that are either independent from relevance (fairness-only) or conditioned on relevance (joint measures). While fairness-only measures have been studied extensively, we look into whether joint measures can be trusted. We collect all joint evaluation measures of RS relevance and fairness, and ask: How much do they agree with each other? To what extent do they agree with relevance/fairness measures? How sensitive are they to changes in rank position, or to increasingly fair and relevant recommendations? We empirically study for the first time the behaviour of these measures across 4 real-world datasets and 4 recommenders. We find that most of these measures: i) correlate weakly with one another and even contradict each other at times; ii) are less sensitive to rank position changes than relevance- and fairness-only measures, meaning that they are less granular than traditional RS measures; and iii) tend to compress scores at the low end of their range, meaning that they are not very expressive. We counter the above limitations with a set of guidelines on the appropriate usage of such measures, i.e., they should be used with caution due to their tendency to contradict each other and of having a very small empirical range.
翻译:相关性与公平性是推荐系统(RSs)的两大核心目标。近期研究提出了独立于相关性的公平性度量(仅公平性度量)以及以相关性为条件的联合度量。尽管仅公平性度量已得到广泛研究,本文旨在探究联合度量是否值得信赖。我们收集了所有关于推荐系统相关性与公平性的联合评估度量,并提出以下问题:这些度量之间的一致性程度如何?它们与相关性/公平性度量的吻合度有多高?它们对排名位置变化、或对日益公平且相关的推荐有多敏感?我们首次基于4个真实世界数据集和4种推荐器,对这些度量的行为进行了实证研究。研究发现,大多数此类度量具有以下特征:i) 彼此间相关性较弱,甚至时常相互矛盾;ii) 对排名位置变化的敏感度低于仅相关性度量和仅公平性度量,意味着其粒度较传统推荐系统度量更为粗糙;iii) 倾向于将分数压缩在量程低端,表明其表达能力有限。针对上述局限性,我们提出了一套关于此类度量合理使用的指导原则:鉴于其易相互矛盾且实证量程极小的特性,使用时应保持谨慎态度。