Fairness is an emerging and challenging topic in recommender systems. In recent years, various ways of evaluating and therefore improving fairness have emerged. In this study, we examine existing evaluation measures of fairness in recommender systems. Specifically, we focus solely on exposure-based fairness measures of individual items that aim to quantify the disparity in how individual items are recommended to users, separate from item relevance to users. We gather all such measures and we critically analyse their theoretical properties. We identify a series of limitations in each of them, which collectively may render the affected measures hard or impossible to interpret, to compute, or to use for comparing recommendations. We resolve these limitations by redefining or correcting the affected measures, or we argue why certain limitations cannot be resolved. We further perform a comprehensive empirical analysis of both the original and our corrected versions of these fairness measures, using real-world and synthetic datasets. Our analysis provides novel insights into the relationship between measures based on different fairness concepts, and different levels of measure sensitivity and strictness. We conclude with practical suggestions of which fairness measures should be used and when. Our code is publicly available. To our knowledge, this is the first critical comparison of individual item fairness measures in recommender systems.
翻译:公平性是推荐系统中一个新兴且具有挑战性的课题。近年来,各种评估进而改进公平性的方法不断涌现。在本研究中,我们考察了推荐系统中现有的公平性评估度量。具体而言,我们仅关注基于曝光度的个体项目公平性度量,这些度量旨在量化个体项目在推荐给用户时(独立于项目与用户的相关性)的差异程度。我们收集了所有此类度量,并对其理论性质进行了批判性分析。我们识别出每种度量存在的若干局限性,这些局限性共同可能导致受影响的度量难以(或无法)解释、计算或用于比较推荐结果。我们通过重新定义或修正受影响的度量来解决这些局限性,或者论证为何某些局限性无法被解决。我们进一步利用真实数据集和合成数据集,对这些公平性度量的原始版本和修正版本进行了全面的实证分析。我们的分析揭示了基于不同公平概念的度量之间、以及度量敏感性与严格性不同层次之间的关系,并提出了新颖见解。最后,我们得出了关于何时应使用何种公平性度量的实用性建议。我们的代码已公开。据我们所知,这是首次对推荐系统中个体项目公平性度量进行批判性比较研究。