One of the many fairness definitions pursued in recent recommender system research targets mitigating demographic information encoded in model representations. Models optimized for this definition are typically evaluated on how well demographic attributes can be classified given model representations, with the (implicit) assumption that this measure accurately reflects \textit{recommendation parity}, i.e., how similar recommendations given to different users are. We challenge this assumption by comparing the amount of demographic information encoded in representations with various measures of how the recommendations differ. We propose two new approaches for measuring how well demographic information can be classified given ranked recommendations. Our results from extensive testing of multiple models on one real and multiple synthetically generated datasets indicate that optimizing for fair representations positively affects recommendation parity, but also that evaluation at the representation level is not a good proxy for measuring this effect when comparing models. We also provide extensive insight into how recommendation-level fairness metrics behave for various models by evaluating their performances on numerous generated datasets with different properties.
翻译:近年来,推荐系统研究追求的众多公平性定义之一,旨在缓解模型表征中编码的人口统计学信息。针对该定义优化的模型通常通过评估从模型表征中分类人口统计属性的能力来验证,其(隐含)假设是这一度量能准确反映\textit{推荐平等性},即不同用户所得推荐的相似程度。我们通过比较表征中编码的人口统计学信息量与推荐差异的多种度量方式,对该假设提出质疑。我们提出两种新方法,用于衡量在基于排序的推荐结果中人口统计学信息被分类的准确性。基于一个真实数据集和多个合成数据集对多种模型进行广泛测试的结果表明:优化公平表征对推荐平等性具有正向影响,但在模型比较时,表征层面的评估并不能作为衡量该影响的有效代理指标。此外,我们通过评估不同特性合成数据集上多种模型的性能,深入揭示了推荐层面公平性度量在不同模型中的行为特征。