A Modular System for Enhanced Robustness of Multimedia Understanding Networks via Deep Parametric Estimation

In multimedia understanding tasks, corrupted samples pose a critical challenge, because when fed to machine learning models they lead to performance degradation. In the past, three groups of approaches have been proposed to handle noisy data: i) enhancer and denoiser modules to improve the quality of the noisy data, ii) data augmentation approaches, and iii) domain adaptation strategies. All the aforementioned approaches come with drawbacks that limit their applicability; the first has high computational costs and requires pairs of clean-corrupted data for training, while the others only allow deployment of the same task/network they were trained on (\ie, when upstream and downstream task/network are the same). In this paper, we propose SyMPIE to solve these shortcomings. To this end, we design a small, modular, and efficient (just 2GFLOPs to process a Full HD image) system to enhance input data for robust downstream multimedia understanding with minimal computational cost. Our SyMPIE is pre-trained on an upstream task/network that should not match the downstream ones and does not need paired clean-corrupted samples. Our key insight is that most input corruptions found in real-world tasks can be modeled through global operations on color channels of images or spatial filters with small kernels. We validate our approach on multiple datasets and tasks, such as image classification (on ImageNetC, ImageNetC-Bar, VizWiz, and a newly proposed mixed corruption benchmark named ImageNetC-mixed) and semantic segmentation (on Cityscapes, ACDC, and DarkZurich) with consistent improvements of about 5\% relative accuracy gain across the board. The code of our approach and the new ImageNetC-mixed benchmark will be made available upon publication.

翻译：在多媒体理解任务中，受污染样本构成了严峻挑战，因为将其输入机器学习模型会导致性能下降。过去已提出三类应对噪声数据的方法：i) 增强器和去噪模块以提升噪声数据质量，ii) 数据增强方法，iii) 域适应策略。上述方法均存在限制其适用性的缺陷：第一类方法计算成本高且需要干净-污染数据对进行训练，而其他方法仅允许在相同任务/网络上部署（即上游和下游任务/网络相同时）。本文提出的SyMPIE旨在解决这些不足。为此，我们设计了一个小型、模块化、高效（处理全高清图像仅需2GFLOPs）的系统，以最小计算成本增强输入数据，实现鲁棒的下游多媒体理解。我们的SyMPIE在上游任务/网络上进行预训练时无需与下游任务/网络匹配，且不需要配对干净-污染样本。关键洞察在于，现实任务中大多数输入污染可通过图像颜色通道的全局操作或小核空间滤波器建模。我们在多个数据集和任务上验证了方法，包括图像分类（在ImageNetC、ImageNetC-Bar、VizWiz和新提出的混合污染基准ImageNetC-mixed上）和语义分割（在Cityscapes、ACDC和DarkZurich上），均实现约5%的相对精度一致提升。代码和新基准ImageNetC-mixed将在论文发表后公开。