Multimodal image-text models have shown remarkable performance in the past few years. However, evaluating robustness against distribution shifts is crucial before adopting them in real-world applications. In this work, we investigate the robustness of 12 popular open-sourced image-text models under common perturbations on five tasks (image-text retrieval, visual reasoning, visual entailment, image captioning, and text-to-image generation). In particular, we propose several new multimodal robustness benchmarks by applying 17 image perturbation and 16 text perturbation techniques on top of existing datasets. We observe that multimodal models are not robust to image and text perturbations, especially to image perturbations. Among the tested perturbation methods, character-level perturbations constitute the most severe distribution shift for text, and zoom blur is the most severe shift for image data. We also introduce two new robustness metrics (\textbf{MMI} for MultiModal Impact score and \textbf{MOR} for Missing Object Rate) for proper evaluations of multimodal models. We hope our extensive study sheds light on new directions for the development of robust multimodal models. More details can be found on the project webpage: \url{https://MMRobustness.github.io}.
翻译:多模态图像-文本模型在过去几年中展现出了卓越的性能。然而,在将其应用于实际场景之前,评估其在分布偏移下的鲁棒性至关重要。本研究系统探究了12种主流开源多模态图像-文本模型在五项任务(图像-文本检索、视觉推理、视觉蕴含、图像描述和文本到图像生成)中面对常见扰动时的鲁棒性。具体而言,我们通过在现有数据集上施加17种图像扰动和16种文本扰动技术,构建了多个新型多模态鲁棒性基准。实验发现,多模态模型对图像和文本扰动均缺乏鲁棒性,尤其对图像扰动更为敏感。在测试的扰动方法中,字符级扰动构成了文本数据最严重的分布偏移,而缩放模糊则是图像数据最严重的分布偏移。我们同时提出了两种新型鲁棒性评估指标(\textbf{MMI}多模态影响分数和\textbf{MOR}缺失对象率),以对多模态模型进行恰当评估。希望本项系统性研究能为鲁棒多模态模型的发展指明新方向。更多详情请访问项目网页:\url{https://MMRobustness.github.io}。