For a machine learning model to generalize effectively to unseen data within a particular problem domain, it is well-understood that the data needs to be of sufficient size and representative of real-world scenarios. Nonetheless, real-world datasets frequently have overrepresented and underrepresented groups. One solution to mitigate bias in machine learning is to leverage a diverse and representative dataset. Training a model on a dataset that covers all demographics is crucial to reducing bias in machine learning. However, collecting and labeling large-scale datasets has been challenging, prompting the use of synthetic data generation and active labeling to decrease the costs of manual labeling. The focus of this study was to generate a robust face image dataset using the StyleGAN model. In order to achieve a balanced distribution of the dataset among different demographic groups, a synthetic dataset was created by controlling the generation process of StyleGaN and annotated for different downstream tasks.
翻译:对于机器学习模型在特定问题领域内有效泛化至未见数据,需确保数据具有足够规模且能代表真实场景,这已是公认的共识。然而,真实世界数据集普遍存在过度代表与欠代表群体的问题。缓解机器学习偏差的解决方案之一是采用多元化且具代表性的数据集。在覆盖所有人口统计学特征的训练集上建模,对降低机器学习偏差至关重要。然而,大规模数据集的收集与标注始终面临挑战,这促使研究者采用合成数据生成与主动标注方法以降低人工标注成本。本研究聚焦于利用StyleGAN模型生成鲁棒的人脸图像数据集。为实现不同人口群体间的数据集均衡分布,通过控制StyleGaN的生成过程构建合成数据集,并针对不同下游任务进行了标注。