Test-Time Adaptation (TTA) has recently emerged as a promising strategy for tackling the problem of machine learning model robustness under distribution shifts by adapting the model during inference without access to any labels. Because of task difficulty, hyperparameters strongly influence the effectiveness of adaptation. However, the literature has provided little exploration into optimal hyperparameter selection. In this work, we tackle this problem by evaluating existing TTA methods using surrogate-based hp-selection strategies (which do not assume access to the test labels) to obtain a more realistic evaluation of their performance. We show that some of the recent state-of-the-art methods exhibit inferior performance compared to the previous algorithms when using our more realistic evaluation setup. Further, we show that forgetting is still a problem in TTA as the only method that is robust to hp-selection resets the model to the initial state at every step. We analyze different types of unsupervised selection strategies, and while they work reasonably well in most scenarios, the only strategies that work consistently well use some kind of supervision (either by a limited number of annotated test samples or by using pretraining data). Our findings underscore the need for further research with more rigorous benchmarking by explicitly stating model selection strategies, to facilitate which we open-source our code.
翻译:测试时适应(TTA)作为一种新兴策略,旨在解决机器学习模型在分布偏移下的鲁棒性问题,其特点是在推理过程中无需任何标签即可对模型进行适应。由于任务难度较高,超参数对适应效果有显著影响。然而,现有文献对最优超参数选择的研究尚不充分。本研究通过评估现有TTA方法,采用基于代理的无监督超参数选择策略(不假设可获得测试标签),以获取对其性能更现实的评估。结果表明,在我们更现实的评估设置下,部分近期先进方法的性能反而逊于早期算法。此外,我们指出遗忘问题在TTA中依然存在——唯一对超参数选择具有鲁棒性的方法需要在每一步都将模型重置至初始状态。我们分析了不同类型的无监督选择策略,发现尽管它们在多数场景下表现良好,但唯一能持续稳定工作的策略仍需某种形式的监督(或通过少量标注测试样本,或利用预训练数据)。本研究结果强调,未来研究需要采用更严谨的基准测试方法,并明确说明模型选择策略。为促进相关研究,我们已开源代码。