While there have been considerable advancements in machine learning driven by extensive datasets, a significant disparity still persists in the availability of data across various sources and populations. This inequality across domains poses challenges in modeling for those with limited data, which can lead to profound practical and ethical concerns. In this paper, we address a representative case of data inequality problem across domains termed Semi-Supervised Domain Generalization (SSDG), in which only one domain is labeled while the rest are unlabeled. We propose a novel algorithm, ProUD, which can effectively learn domain-invariant features via domain-aware prototypes along with progressive generalization via uncertainty-adaptive mixing of labeled and unlabeled domains. Our experiments on three different benchmark datasets demonstrate the effectiveness of ProUD, outperforming all baseline models including single domain generalization and semi-supervised learning. Source code will be released upon acceptance of the paper.
翻译:尽管大规模数据集推动了机器学习的显著进步,但不同数据源与人群之间的数据可用性仍存在显著差异。这种跨领域的不平等性给数据资源有限的领域建模带来了挑战,可能引发深刻的实践与伦理问题。本文针对数据不平等问题的一个典型场景——半监督领域泛化(Semi-Supervised Domain Generalization, SSDG)展开研究,其中仅有一个领域包含标签,其余领域均无标签。我们提出新型算法ProUD,该算法通过基于领域感知的原型有效学习领域不变特征,并利用不确定性自适应混合有标签与无标签领域实现渐进式泛化。在三个不同基准数据集上的实验表明,ProUD优于所有基线模型(包括单领域泛化与半监督学习方法)。论文被接收后将公开源代码。