We introduce a fairness-aware dataset for job recommendations in advertising, designed to foster research in algorithmic fairness within real-world scenarios. It was collected and prepared to comply with privacy standards and business confidentiality. An additional challenge is the lack of access to protected user attributes such as gender, for which we propose a solution to obtain a proxy estimate. Despite being anonymized and including a proxy for a sensitive attribute, our dataset preserves predictive power and maintains a realistic and challenging benchmark. This dataset addresses a significant gap in the availability of fairness-focused resources for high-impact domains like advertising -- the actual impact being having access or not to precious employment opportunities, where balancing fairness and utility is a common industrial challenge. We also explore various stages in the advertising process where unfairness can occur and introduce a method to compute a fair utility metric for the job recommendations in online systems case from a biased dataset. Experimental evaluations of bias mitigation techniques on the released dataset demonstrate potential improvements in fairness and the associated trade-offs with utility. The dataset is hosted at https://huggingface.co/datasets/criteo/FairJob. Source code for the experiments is hosted at https://github.com/criteo-research/FairJob-dataset/.
翻译:我们引入了一个用于广告中职位推荐的公平性感知数据集,旨在促进真实场景下算法公平性的研究。该数据集的收集和准备过程遵循隐私标准和商业保密要求。一个额外的挑战是缺乏对受保护用户属性(如性别)的访问权限,对此我们提出了一种获取代理估计的解决方案。尽管经过匿名化处理并包含敏感属性的代理,我们的数据集仍保留了预测能力,并提供了一个现实且具有挑战性的基准。该数据集解决了在高影响力领域(如广告)中缺乏以公平性为重点资源的显著空白——实际影响在于能否获得宝贵的就业机会,其中平衡公平性与效用是工业界常见的挑战。我们还探讨了广告过程中可能发生不公平现象的各种阶段,并引入了一种方法,用于从有偏见的数据集中计算在线系统职位推荐的公平效用指标。在发布的数据集上对偏见缓解技术进行的实验评估表明,公平性可能得到改善,并揭示了其与效用之间的权衡。该数据集托管于 https://huggingface.co/datasets/criteo/FairJob。实验源代码托管于 https://github.com/criteo-research/FairJob-dataset/。