Fairness is critical for artificial intelligence systems, especially for those deployed in high-stakes applications such as hiring and justice. Existing efforts toward fairness in machine learning fairness require retraining or fine-tuning the neural network weights to meet the fairness criteria. However, this is often not feasible in practice for regular model users due to the inability to access and modify model weights. In this paper, we propose a more flexible fairness paradigm, Inference-Time Rule Eraser, or simply Eraser, which considers the case where model weights can not be accessed and tackles fairness issues from the perspective of biased rules removal at inference-time. We first verified the feasibility of modifying the model output to wipe the biased rule through Bayesian analysis, and deduced Inference-Time Rule Eraser via subtracting the logarithmic value associated with unfair rules (i.e., the model's response to biased features) from the model's logits output as a means of removing biased rules. Moreover, we present a specific implementation of Rule Eraser that involves two stages: (1) limited queries are performed on the model with inaccessible weights to distill its biased rules into an additional patched model, and (2) during inference time, the biased rules already distilled into the patched model are excluded from the output of the original model, guided by the removal strategy outlined in Rule Eraser. Exhaustive experimental evaluation demonstrates the effectiveness and superior performance of the proposed Rule Eraser in addressing fairness concerns.
翻译:公平性对于人工智能系统至关重要,尤其是那些部署于招聘、司法等高风险场景的系统。现有机器学习公平性研究通常需要重新训练或微调神经网络权重以满足公平性准则,然而在实际中,普通模型用户往往无法访问或修改模型权重,导致此方法难以实施。本文提出一种更灵活的公平性范式——推理时规则擦除(简称Eraser),该范式考虑用户无法访问模型权重的情况,从推理时消除偏见规则的角度解决公平性问题。我们首先通过贝叶斯分析验证了修改模型输出以消除偏见规则的可行性,进而推导出推理时规则擦除方法:从模型logits输出中减去与不公平规则相关的对数值(即模型对偏见特征的响应),从而消除偏见规则。此外,我们提出了规则擦除的具体实现方案,包含两个阶段:(1)对权重不可访问的模型进行有限次查询,将其偏见规则蒸馏至一个额外的修补模型中;(2)在推理阶段,依据规则擦除的消除策略,从原始模型输出中排除已蒸馏至修补模型中的偏见规则。详尽的实验评估表明,所提出的规则擦除方法在解决公平性问题方面具有有效性和优越性能。