Numerous approaches have been recently proposed for learning fair representations that mitigate unfair outcomes in prediction tasks. A key motivation for these methods is that the representations can be used by third parties with unknown objectives. However, because current fair representations are generally not interpretable, the third party cannot use these fair representations for exploration, or to obtain any additional insights, besides the pre-contracted prediction tasks. Thus, to increase data utility beyond prediction tasks, we argue that the representations need to be fair, yet interpretable. We propose a general framework for learning interpretable fair representations by introducing an interpretable "prior knowledge" during the representation learning process. We implement this idea and conduct experiments with ColorMNIST and Dsprite datasets. The results indicate that in addition to being interpretable, our representations attain slightly higher accuracy and fairer outcomes in a downstream classification task compared to state-of-the-art fair representations.
翻译:近年来,针对学习公平表征以缓解预测任务中不公平结果的方法层出不穷。这些方法的一个关键动机在于,第三方可在未知其目标的情况下使用这些表征。然而,由于当前公平表征通常不具备可解释性,第三方除了预定的预测任务外,无法利用这些公平表征进行探索或获取额外洞见。因此,为提高预测任务之外的数据效用,我们认为表征需兼具公平性与可解释性。我们提出一个学习可解释公平表征的通用框架,通过在表征学习过程中引入可解释的“先验知识”来实现这一目标。我们在ColorMNIST和Dsprite数据集上实现了该思路并开展实验。结果表明,与最先进的公平表征相比,我们的表征不仅具备可解释性,在下游分类任务中还取得了略高的准确率与更公平的结果。