Distribution shifts are all too common in real-world applications of machine learning. Domain adaptation (DA) aims to address this by providing various frameworks for adapting models to the deployment data without using labels. However, the domain shift scenario raises a second more subtle challenge: the difficulty of performing hyperparameter optimisation (HPO) for these adaptation algorithms without access to a labelled validation set. The unclear validation protocol for DA has led to bad practices in the literature, such as performing HPO using the target test labels when, in real-world scenarios, they are not available. This has resulted in over-optimism about DA research progress compared to reality. In this paper, we analyse the state of DA when using good evaluation practice, by benchmarking a suite of candidate validation criteria and using them to assess popular adaptation algorithms. We show that there are challenges across all three branches of domain adaptation methodology including Unsupervised Domain Adaptation (UDA), Source-Free Domain Adaptation (SFDA), and Test Time Adaptation (TTA). While the results show that realistically achievable performance is often worse than expected, they also show that using proper validation splits is beneficial, as well as showing that some previously unexplored validation metrics provide the best options to date. Altogether, our improved practices covering data, training, validation and hyperparameter optimisation form a new rigorous pipeline to improve benchmarking, and hence research progress, within this important field going forward.
翻译:分布偏移在机器学习的实际应用中十分常见。领域自适应旨在通过提供多种框架,在不使用标签的情况下使模型适应部署数据。然而,领域偏移场景带来了第二个更为微妙的挑战:在无法访问带标签验证集的情况下,对这些自适应算法进行超参数优化存在困难。领域自适应中不清晰的验证协议导致文献中出现了不良实践,例如在目标测试标签不可用的现实场景中却利用其进行超参数优化。这导致与实际情况相比,人们对领域自适应研究进展过于乐观。本文通过基准测试一系列候选验证标准,并利用它们评估流行的自适应算法,分析了采用良好评估实践时的领域自适应现状。研究表明,领域自适应方法的三个分支——无监督领域自适应、源无关领域自适应和测试时自适应——均面临挑战。尽管结果显示实际可达到的性能通常低于预期,但也表明使用适当的验证划分具有益处,同时一些先前未被探索的验证指标提供了迄今为止的最佳选择。总体而言,我们涵盖数据、训练、验证和超参数优化的改进实践构成了一个严格的新流程,有助于推动该重要领域未来的基准测试改进及研究进展。