Domain shifts in the training data are common in practical applications of machine learning; they occur for instance when the data is coming from different sources. Ideally, a ML model should work well independently of these shifts, for example, by learning a domain-invariant representation. However, common ML losses do not give strong guarantees on how consistently the ML model performs for different domains, in particular, whether the model performs well on a domain at the expense of its performance on another domain. In this paper, we build new theoretical foundations for this problem, by contributing a set of mathematical relations between classical losses for supervised ML and the Wasserstein distance in joint space (i.e. representation and output space). We show that classification or regression losses, when combined with a GAN-type discriminator between domains, form an upper-bound to the true Wasserstein distance between domains. This implies a more invariant representation and also more stable prediction performance across domains. Theoretical results are corroborated empirically on several image datasets. Our proposed approach systematically produces the highest minimum classification accuracy across domains, and the most invariant representation.
翻译:训练数据中的域偏移是机器学习实际应用中常见的问题,例如当数据来自不同来源时。理想情况下,机器学习模型应能够独立于这些偏移而良好工作,例如通过学习域不变表示。然而,常见的机器学习损失函数并未对模型在不同域上性能的一致性提供强保证,特别是模型是否以牺牲某个域的性能为代价来提升另一个域的性能。本文通过建立有监督机器学习经典损失与联合空间(即表示空间和输出空间)中Wasserstein距离之间的一系列数学关系,为该问题奠定了新的理论基础。我们证明,当分类或回归损失与域间的生成对抗网络(GAN)类型判别器结合时,构成了域间真实Wasserstein距离的上界。这暗示了更不变的表征以及跨域更稳定的预测性能。理论结果在多个图像数据集上得到了经验验证。我们提出的方法系统性地产生了跨域最高最小分类精度以及最不变的表征。