Machine-learned systems are in widespread use for making decisions about humans, and it is important that they are fair, i.e., not biased against individuals based on sensitive attributes. We present runtime verification of algorithmic fairness for systems whose models are unknown, but are assumed to have a Markov chain structure. We introduce a specification language that can model many common algorithmic fairness properties, such as demographic parity, equal opportunity, and social burden. We build monitors that observe a long sequence of events as generated by a given system, and output, after each observation, a quantitative estimate of how fair or biased the system was on that run until that point in time. The estimate is proven to be correct modulo a variable error bound and a given confidence level, where the error bound gets tighter as the observed sequence gets longer. Our monitors are of two types, and use, respectively, frequentist and Bayesian statistical inference techniques. While the frequentist monitors compute estimates that are objectively correct with respect to the ground truth, the Bayesian monitors compute estimates that are correct subject to a given prior belief about the system's model. Using a prototype implementation, we show how we can monitor if a bank is fair in giving loans to applicants from different social backgrounds, and if a college is fair in admitting students while maintaining a reasonable financial burden on the society. Although they exhibit different theoretical complexities in certain cases, in our experiments, both frequentist and Bayesian monitors took less than a millisecond to update their verdicts after each observation.
翻译:机器学习系统被广泛应用于涉及人类决策的场景,确保其公平性至关重要——即不基于敏感属性对个体产生偏见。本文针对模型未知但假设具有马尔可夫链结构的系统,提出算法公平性的运行时验证方法。我们引入了一种规范语言,可建模多种常见算法公平性属性,如人口统计平等、机会均等和社会负担。我们构建监测器,通过观察系统生成的长期事件序列,在每次观察后输出该运行过程中截至当前时刻系统公平性或偏倚程度的量化估计。该估计被证明在可变误差界和给定置信水平下是正确的,且随着观察序列增长,误差界逐步收紧。我们的监测器分为两类,分别采用频率主义与贝叶斯统计推断技术:频率主义监测器计算基于真实情况客观正确的估计值,而贝叶斯监测器则依据对系统模型的先验信念生成估计值。通过原型实现,我们展示了如何监测银行向不同社会背景申请者发放贷款的公平性,以及大学在维持合理社会负担的前提下招生决策的公平性。尽管两类监测器在特定场景下展现出不同的理论复杂度,但在实验中,它们每次观察后更新判定结果的时间均不超过一毫秒。