We consider the problem of online allocation subject to a long-term fairness penalty. Contrary to existing works, however, we do not assume that the decision-maker observes the protected attributes -- which is often unrealistic in practice. Instead they can purchase data that help estimate them from sources of different quality; and hence reduce the fairness penalty at some cost. We model this problem as a multi-armed bandit problem where each arm corresponds to the choice of a data source, coupled with the online allocation problem. We propose an algorithm that jointly solves both problems and show that it has a regret bounded by $\mathcal{O}(\sqrt{T})$. A key difficulty is that the rewards received by selecting a source are correlated by the fairness penalty, which leads to a need for randomization (despite a stochastic setting). Our algorithm takes into account contextual information available before the source selection, and can adapt to many different fairness notions. We also show that in some instances, the estimates used can be learned on the fly.
翻译:我们研究了在长期公平惩罚约束下的在线分配问题。与现有研究不同,我们并未假设决策者能够观察到受保护属性——这在实际中往往不切实际。相反,决策者可以购买来自不同质量来源的数据来帮助估计这些属性,从而以一定成本降低公平惩罚。我们将该问题建模为一个多臂老虎机问题,其中每个臂对应数据源的选择,并与在线分配问题相耦合。我们提出了一种联合求解这两个问题的算法,并证明其遗憾界为$\mathcal{O}(\sqrt{T})$。一个关键难点在于,选择数据源所获得的奖励因公平惩罚而相互关联,这导致了随机化需求(尽管是随机设定下)。我们的算法考虑了数据源选择前可获得的上下文信息,并能适应多种不同的公平性概念。我们还证明了在某些情况下,所使用的估计值可以实时学习得到。