A Modular System for Enhanced Robustness of Multimedia Understanding Networks via Deep Parametric Estimation

In multimedia understanding tasks, corrupted samples pose a critical challenge, because when fed to machine learning models they lead to performance degradation. In the past, three groups of approaches have been proposed to handle noisy data: i) enhancer and denoiser modules to improve the quality of the noisy data, ii) data augmentation approaches, and iii) domain adaptation strategies. All the aforementioned approaches come with drawbacks that limit their applicability; the first has high computational costs and requires pairs of clean-corrupted data for training, while the others only allow deployment of the same task/network they were trained on (\ie, when upstream and downstream task/network are the same). In this paper, we propose SyMPIE to solve these shortcomings. To this end, we design a small, modular, and efficient (just 2GFLOPs to process a Full HD image) system to enhance input data for robust downstream multimedia understanding with minimal computational cost. Our SyMPIE is pre-trained on an upstream task/network that should not match the downstream ones and does not need paired clean-corrupted samples. Our key insight is that most input corruptions found in real-world tasks can be modeled through global operations on color channels of images or spatial filters with small kernels. We validate our approach on multiple datasets and tasks, such as image classification (on ImageNetC, ImageNetC-Bar, VizWiz, and a newly proposed mixed corruption benchmark named ImageNetC-mixed) and semantic segmentation (on Cityscapes, ACDC, and DarkZurich) with consistent improvements of about 5\% relative accuracy gain across the board. The code of our approach and the new ImageNetC-mixed benchmark will be made available upon publication.

翻译：在多媒体理解任务中，损坏样本构成关键挑战：当这些样本输入机器学习模型时，会导致性能下降。过去，研究人员提出了三类处理噪声数据的方法：i) 用于提升噪声数据质量的增强与去噪模块，ii) 数据增强方法，以及iii) 领域自适应策略。上述方法均存在限制其适用性的缺陷：第一类方法计算成本高，且需要成对的干净-损坏数据进行训练；而其他方法仅能部署在与训练时相同的任务/网络上（即上下游任务/网络相同时）。本文提出SyMPIE以解决上述不足。为此，我们设计了一个小型化、模块化且高效的系统（处理全高清图像仅需2GFLOPs），通过最小化计算成本增强输入数据以实现鲁棒的下游多媒体理解。我们的SyMPIE在上下游任务/网络不必匹配的条件下进行预训练，且无需成对的干净-损坏样本。核心见解在于：现实任务中的大多数输入损坏可通过图像颜色通道的全局操作或小核空间滤波器进行建模。我们在多个数据集和任务上验证了该方法，包括图像分类（在ImageNetC、ImageNetC-Bar、VizWiz及新提出的混合损坏基准ImageNetC-mixed上）和语义分割（在Cityscapes、ACDC和DarkZurich上），整体上实现了约5%的相对准确率提升。我们的方法代码及新提出的ImageNetC-mixed基准将在论文发表后公开提供。