Many promising applications of supervised machine learning face hurdles in the acquisition of labeled data in sufficient quantity and quality, creating an expensive bottleneck. To overcome such limitations, techniques that do not depend on ground truth labels have been studied, including weak supervision and generative modeling. While these techniques would seem to be usable in concert, improving one another, how to build an interface between them is not well-understood. In this work, we propose a model fusing programmatic weak supervision and generative adversarial networks and provide theoretical justification motivating this fusion. The proposed approach captures discrete latent variables in the data alongside the weak supervision derived label estimate. Alignment of the two allows for better modeling of sample-dependent accuracies of the weak supervision sources, improving the estimate of unobserved labels. It is the first approach to enable data augmentation through weakly supervised synthetic images and pseudolabels. Additionally, its learned latent variables can be inspected qualitatively. The model outperforms baseline weak supervision label models on a number of multiclass image classification datasets, improves the quality of generated images, and further improves end-model performance through data augmentation with synthetic samples.
翻译:许多监督式机器学习的应用在获取足够数量和质量的有标签数据时面临障碍,形成昂贵的瓶颈。为克服此类限制,研究者已探索不依赖真实标签的技术,包括弱监督与生成式建模。尽管这些技术看似可协同作用、相互促进,但如何构建两者间的接口尚未被充分理解。本文提出一种融合程序化弱监督与生成对抗网络的模型,并提供理论依据证明其融合的合理性。该模型在捕捉数据中离散潜变量的同时,结合从弱监督源获得的标签估计。两者的对齐使得弱监督源样本依赖精度的建模更加完善,从而改进未观测标签的估计。这是首个通过弱监督合成图像与伪标签实现数据增强的方法。此外,模型学习到的潜变量可进行定性检验。该模型在多类别图像分类数据集上优于基线弱监督标签模型,提升了生成图像的质量,并通过合成样本数据增强进一步提升了最终模型性能。