Can Visual Question Answering (VQA) systems perform just as well when deployed in the real world? Or are they susceptible to realistic corruption effects e.g. image blur, which can be detrimental in sensitive applications, such as medical VQA? While linguistic or textual robustness has been thoroughly explored in the VQA literature, there has yet to be any significant work on the visual robustness of VQA models. We propose the first large-scale benchmark comprising 213,000 augmented images, challenging the visual robustness of multiple VQA models and assessing the strength of realistic visual corruptions. Additionally, we have designed several robustness evaluation metrics that can be aggregated into a unified metric and tailored to fit a variety of use cases. Our experiments reveal several insights into the relationships between model size, performance, and robustness with the visual corruptions. Our benchmark highlights the need for a balanced approach in model development that considers model performance without compromising the robustness.
翻译:视觉问答(VQA)系统在现实世界中部署时能否表现同样出色?抑或它们容易受到现实世界图像退化效应(例如图像模糊)的影响,从而在医疗VQA等敏感应用中产生不利后果?尽管VQA文献中对语言或文本鲁棒性已进行了深入探索,但针对VQA模型视觉鲁棒性的重要研究尚属空白。我们提出了首个大规模基准数据集,包含213,000张增强图像,旨在挑战多种VQA模型的视觉鲁棒性,并评估现实视觉退化效应的强度。此外,我们设计了若干鲁棒性评估指标,这些指标可聚合为统一度量标准,并能根据多样化应用场景进行定制。我们的实验揭示了模型规模、性能与视觉退化鲁棒性之间的若干关系。本基准研究强调,在模型开发中需要采取平衡策略,在考虑模型性能的同时不损害其鲁棒性。