While deep learning models are powerful tools that revolutionized many areas, they are also vulnerable to noise as they rely heavily on learning patterns and features from the exact details of the clean data. Transformers, which have become the backbone of modern vision models, are no exception. Current Discrete Wavelet Transforms (DWT) based methods do not benefit from masked autoencoder (MAE) pre-training since the inverse DWT (iDWT) introduced in these approaches is computationally inefficient and lacks compatibility with video inputs in transformer architectures. In this work, we present RobustFormer, a method that overcomes these limitations by enabling noise-robust pre-training for both images and videos; improving the efficiency of DWT-based methods by removing the need for computationally iDWT steps and simplifying the attention mechanism. To our knowledge, the proposed method is the first DWT-based method compatible with video inputs and masked pre-training. Our experiments show that MAE-based pre-training allows us to bypass the iDWT step, greatly reducing computation. Through extensive tests on benchmark datasets, RobustFormer achieves state-of-the-art results for both image and video tasks.
翻译:尽管深度学习模型作为强大工具已革新诸多领域,但其高度依赖从洁净数据精确细节中学习模式与特征的特点,也使其易受噪声干扰。已成为现代视觉模型核心架构的Transformer亦不例外。当前基于离散小波变换(DWT)的方法无法受益于掩码自编码器(MAE)预训练,因为这些方法中引入的逆离散小波变换(iDWT)计算效率低下,且与Transformer架构中的视频输入缺乏兼容性。本研究提出RobustFormer,该方法通过实现图像与视频的噪声鲁棒预训练克服了上述局限:通过消除计算密集的iDWT步骤并简化注意力机制,提升了基于DWT方法的效率。据我们所知,所提方法是首个兼容视频输入与掩码预训练的DWT方法。实验表明,基于MAE的预训练使我们能够绕过iDWT步骤,大幅降低计算开销。通过在基准数据集上的大量测试,RobustFormer在图像与视频任务中均取得了最先进的性能。