Many promising applications of supervised machine learning face hurdles in the acquisition of labeled data in sufficient quantity and quality, creating an expensive bottleneck. To overcome such limitations, techniques that do not depend on ground truth labels have been studied, including weak supervision and generative modeling. While these techniques would seem to be usable in concert, improving one another, how to build an interface between them is not well-understood. In this work, we propose a model fusing programmatic weak supervision and generative adversarial networks and provide theoretical justification motivating this fusion. The proposed approach captures discrete latent variables in the data alongside the weak supervision derived label estimate. Alignment of the two allows for better modeling of sample-dependent accuracies of the weak supervision sources, improving the estimate of unobserved labels. It is the first approach to enable data augmentation through weakly supervised synthetic images and pseudolabels. Additionally, its learned latent variables can be inspected qualitatively. The model outperforms baseline weak supervision label models on a number of multiclass image classification datasets, improves the quality of generated images, and further improves end-model performance through data augmentation with synthetic samples.
翻译:监督式机器学习的许多有前景的应用在获取足量且高质量的标注数据方面面临障碍,这造成了昂贵的瓶颈。为克服这些限制,研究者探索了不依赖真实标签的技术,包括弱监督和生成式建模。尽管这些技术似乎可以协同使用、相互改进,但如何构建它们之间的接口尚不明确。在本工作中,我们提出了一种融合程序化弱监督与生成对抗网络的模型,并提供了理论依据以支持这一融合。所提方法在捕捉数据中的离散潜变量的同时,结合了弱监督来源的标签估计。这两者的对齐能够更好地建模弱监督源样本相关精度,从而改进未观测标签的估计。这是首个通过弱监督合成图像与伪标签实现数据增强的方法。此外,其学习到的潜变量可进行定性分析。该模型在多个多类图像分类数据集上优于基线弱监督标签模型,提升了生成图像的质量,并通过合成样本的数据增强进一步改进了最终模型的性能。