While deep learning-based models like transformers, have revolutionized time-series and vision tasks, they remain highly susceptible to noise and often overfit on noisy patterns rather than robust features. This issue is exacerbated in vision transformers, which rely on pixel-level details that can easily be corrupt. To address this, we leverage the discrete wavelet transform (DWT) for its ability to decompose into multi-resolution layers, isolating noise primarily in the high frequency domain while preserving essential low-frequency information for resilient feature learning. Conventional DWT-based methods, however, struggle with computational inefficiencies due to the requirement for a subsequent inverse discrete wavelet transform (IDWT) step. In this work, we introduce RobustFormer, a novel framework that enables noise-robust masked autoencoder (MAE) pre-training for both images and videos by using DWT for efficient downsampling, eliminating the need for expensive IDWT reconstruction and simplifying the attention mechanism to focus on noise-resilient multi-scale representations. To our knowledge, RobustFormer is the first DWT-based method fully compatible with video inputs and MAE-style pre-training. Extensive experiments on noisy image and video datasets demonstrate that our approach achieves up to 8% increase in Top-1 classification accuracy under severe noise conditions in Imagenet-C and up to 2.7% in Imagenet-P standard benchmarks compared to the baseline and up to 13% higher Top-1 accuracy on UCF-101 under severe custom noise perturbations while maintaining similar accuracy scores for clean datasets. We also observe the reduction of computation complexity by up to 4.4% through IDWT removal compared to VideoMAE baseline without any performance drop.
翻译:尽管基于深度学习的模型(如Transformer)已在时间序列和视觉任务中引发革命性变革,但它们对噪声仍高度敏感,且往往在噪声模式上过拟合,而非学习鲁棒特征。这一问题在视觉Transformer中尤为突出,因其依赖的像素级细节极易被破坏。为解决此问题,我们利用离散小波变换(DWT)能够将信号分解为多分辨率层的能力,将噪声主要隔离在高频域,同时保留关键的低频信息以进行稳健的特征学习。然而,传统基于DWT的方法因需进行后续的逆离散小波变换(IDWT)步骤而存在计算效率低下的问题。本文提出RobustFormer,一种新颖的框架,通过使用DWT进行高效下采样,实现了对图像和视频的噪声鲁棒掩码自编码器(MAE)预训练。该方法无需昂贵的IDWT重建步骤,并简化了注意力机制以专注于噪声鲁棒的多尺度表征。据我们所知,RobustFormer是首个完全兼容视频输入且支持MAE式预训练的基于DWT的方法。在带噪图像和视频数据集上的大量实验表明:在Imagenet-C的严重噪声条件下,我们的方法相比基线在Top-1分类准确率上最高提升8%;在Imagenet-P标准基准测试中最高提升2.7%;在UCF-101数据集上,面对严重自定义噪声干扰时Top-1准确率最高提升13%,同时在洁净数据集上保持相近的准确率。我们还观察到,通过去除IDWT步骤,相比VideoMAE基线,计算复杂度最高降低4.4%,且未出现性能下降。