Test-time adaptation (TTA) has shown to be effective at tackling distribution shifts between training and testing data by adapting a given model on test samples. However, the online model updating of TTA may be unstable and this is often a key obstacle preventing existing TTA methods from being deployed in the real world. Specifically, TTA may fail to improve or even harm the model performance when test data have: 1) mixed distribution shifts, 2) small batch sizes, and 3) online imbalanced label distribution shifts, which are quite common in practice. In this paper, we investigate the unstable reasons and find that the batch norm layer is a crucial factor hindering TTA stability. Conversely, TTA can perform more stably with batch-agnostic norm layers, \ie, group or layer norm. However, we observe that TTA with group and layer norms does not always succeed and still suffers many failure cases. By digging into the failure cases, we find that certain noisy test samples with large gradients may disturb the model adaption and result in collapsed trivial solutions, \ie, assigning the same class label for all samples. To address the above collapse issue, we propose a sharpness-aware and reliable entropy minimization method, called SAR, for further stabilizing TTA from two aspects: 1) remove partial noisy samples with large gradients, 2) encourage model weights to go to a flat minimum so that the model is robust to the remaining noisy samples. Promising results demonstrate that SAR performs more stably over prior methods and is computationally efficient under the above wild test scenarios.
翻译:测试时自适应(TTA)通过基于测试样本调整给定模型,有效应对了训练数据与测试数据之间的分布偏移。然而,TTA的在线模型更新可能不稳定,这往往是阻碍现有TTA方法在实际场景部署的关键障碍。具体而言,当测试数据具有以下特征时:1)混合分布偏移,2)小批量大小,3)在线不均衡标签分布偏移(这些情况在实际中相当常见),TTA可能无法提升甚至损害模型性能。本文深入探究了不稳定的原因,发现批归一化层是影响TTA稳定性的关键因素。相反,采用与批量无关的归一化层(即组归一化或层归一化)时,TTA能够更稳定地运行。但我们观察到,使用组归一化和层归一化的TTA并非总能成功,仍存在诸多失败案例。通过深入分析失败案例,我们发现某些梯度较大的噪声测试样本可能干扰模型自适应,导致模型崩溃为平凡解(即为所有样本分配相同类别标签)。针对上述崩溃问题,我们提出了一种锐度感知且可靠的熵最小化方法(SAR),从两个方面进一步稳定TTA:1)剔除部分梯度较大的噪声样本;2)引导模型权重收敛至平坦最小值,从而增强模型对剩余噪声样本的鲁棒性。令人鼓舞的结果表明,在上述真实测试场景中,SAR相较于现有方法具有更高的稳定性,且计算效率显著提升。