As the use of machine learning models has increased, numerous studies have aimed to enhance fairness. However, research on the intersection of fairness and explainability remains insufficient, leading to potential issues in gaining the trust of actual users. Here, we propose a novel module that constructs a fair latent space, enabling faithful explanation while ensuring fairness. The fair latent space is constructed by disentangling and redistributing labels and sensitive attributes, allowing the generation of counterfactual explanations for each type of information. Our module is attached to a pretrained generative model, transforming its biased latent space into a fair latent space. Additionally, since only the module needs to be trained, there are advantages in terms of time and cost savings, without the need to train the entire generative model. We validate the fair latent space with various fairness metrics and demonstrate that our approach can effectively provide explanations for biased decisions and assurances of fairness.
翻译:随着机器学习模型应用的日益广泛,众多研究致力于提升模型的公平性。然而,针对公平性与可解释性交叉领域的研究仍显不足,这可能导致难以获得实际用户信任的潜在问题。本文提出一种新颖的模块,通过构建公平潜在空间,在确保公平性的同时实现忠实解释。该公平潜在空间通过对标签和敏感属性进行解耦与重新分布而构建,使得能够针对各类信息生成反事实解释。我们的模块可附加于预训练的生成模型之上,将其带有偏见的潜在空间转化为公平潜在空间。此外,由于仅需训练该模块,无需重新训练整个生成模型,在时间和成本节约方面具有显著优势。我们通过多种公平性指标验证了公平潜在空间的有效性,并证明该方法能够为存在偏见的决策提供有效解释,同时确保公平性保证。