Multimodal image-text models have shown remarkable performance in the past few years. However, evaluating robustness against distribution shifts is crucial before adopting them in real-world applications. In this work, we investigate the robustness of 12 popular open-sourced image-text models under common perturbations on five tasks (image-text retrieval, visual reasoning, visual entailment, image captioning, and text-to-image generation). In particular, we propose several new multimodal robustness benchmarks by applying 17 image perturbation and 16 text perturbation techniques on top of existing datasets. We observe that multimodal models are not robust to image and text perturbations, especially to image perturbations. Among the tested perturbation methods, character-level perturbations constitute the most severe distribution shift for text, and zoom blur is the most severe shift for image data. We also introduce two new robustness metrics (\textbf{MMI} for MultiModal Impact score and \textbf{MOR} for Missing Object Rate) for proper evaluations of multimodal models. We hope our extensive study sheds light on new directions for the development of robust multimodal models. More details can be found on the project webpage: \url{https://MMRobustness.github.io}.
翻译:在过去几年中,多模态图文模型展现了卓越的性能。然而,在实际应用部署前,评估其对分布偏移的鲁棒性至关重要。本研究针对五项任务(图文检索、视觉推理、视觉蕴含、图像描述和文生图生成),系统评估了12个开源多模态图文模型在常见扰动下的鲁棒性。具体而言,我们基于现有数据集应用了17种图像扰动和16种文本扰动技术,构建了多个新型多模态鲁棒性基准。实验发现,多模态模型对图像和文本扰动均缺乏鲁棒性,尤其对图像扰动更为敏感。在测试的扰动方法中,字符级扰动对文本构成最严重的分布偏移,而缩放模糊对图像数据影响最大。此外,我们引入了两种新型鲁棒性度量指标(多模态影响得分MMI和缺失对象率MOR),用于更合理地评估多模态模型。期待本项系统性研究能为鲁棒多模态模型的发展指明新方向。更多详情请访问项目网页:\url{https://MMRobustness.github.io}。