Adequate sampling space coverage is the keystone to effectively train trustworthy Machine Learning models. Unfortunately, real data do carry several inherent risks due to the many potential biases they exhibit when gathered without a proper random sampling over the reference population, and most of the times this is way too expensive or time consuming to be a viable option. Depending on how training data have been gathered, unmitigated biases can lead to harmful or discriminatory consequences that ultimately hinders large scale applicability of pre-trained models and undermine their truthfulness or fairness expectations. In this paper, a mixed active sampling and data generation strategy -- called samplation -- is proposed as a mean to compensate during fine-tuning of a pre-trained classifer the unfair classifications it produces, assuming that the training data come from a non-probabilistic sampling schema. Given a pre-trained classifier, first a fairness metric is evaluated on a test set, then new reservoirs of labeled data are generated and finally a number of reversely-biased artificial samples are generated for the fine-tuning of the model. Using as case study Deep Models for visual semantic role labeling, the proposed method has been able to fully cure a simulated gender bias starting from a 90/10 imbalance, with only a small percentage of new data and with a minor effect on accuracy.
翻译:充分的采样空间覆盖是有效训练可信机器学习模型的基石。遗憾的是,真实数据确实携带多种固有风险,因为在缺乏对参考群体进行适当随机抽样的情况下收集数据时,数据会呈现多种潜在偏差,而大多数情况下,进行适当的随机抽样成本过高或耗时过长,难以成为可行的选择。根据训练数据的收集方式,未缓解的偏差可能导致有害或歧视性后果,最终阻碍预训练模型的大规模应用,并损害其真实性或公平性预期。本文提出了一种混合主动采样与数据生成策略——称为“采样生成”——作为一种在预训练分类器微调过程中补偿其产生的不公平分类的方法,前提是训练数据来自非概率抽样方案。给定一个预训练分类器,首先在测试集上评估公平性指标,随后生成新的标注数据池,最后生成一定数量的反向偏差人工样本用于模型微调。以视觉语义角色标注的深度模型作为案例研究,所提方法能够从90/10的初始不平衡状态出发,仅使用少量新数据即完全修复模拟的性别偏差,且对准确率影响甚微。