Deep neural networks have been critical in the task of Visual Question Answering (VQA), with research traditionally focused on improving model accuracy. Recently, however, there has been a trend towards evaluating the robustness of these models against adversarial attacks. This involves assessing the accuracy of VQA models under increasing levels of noise in the input, which can target either the image or the proposed query question, dubbed the main question. However, there is currently a lack of proper analysis of this aspect of VQA. This work proposes a new method that utilizes semantically related questions, referred to as basic questions, acting as noise to evaluate the robustness of VQA models. It is hypothesized that as the similarity of a basic question to the main question decreases, the level of noise increases. To generate a reasonable noise level for a given main question, a pool of basic questions is ranked based on their similarity to the main question, and this ranking problem is cast as a LASSO optimization problem. Additionally, this work proposes a novel robustness measure, R_score, and two basic question datasets to standardize the analysis of VQA model robustness. The experimental results demonstrate that the proposed evaluation method effectively analyzes the robustness of VQA models. Moreover, the experiments show that in-context learning with a chain of basic questions can enhance model accuracy.
翻译:深度神经网络在视觉问答(Visual Question Answering, VQA)任务中发挥着关键作用,传统研究主要聚焦于提升模型准确率。然而,近期趋势转向评估模型对抗攻击的鲁棒性。这涉及评估VQA模型在输入噪声逐步增加时的准确率,噪声可针对图像或所提出的查询问题(称为主问题)。然而,目前对此类VQA问题的分析尚缺乏系统性研究。本文提出一种新方法,利用语义相关的问题(称为基础问题)作为噪声,以评估VQA模型的鲁棒性。其假设是:基础问题与主问题的相似度越低,噪声水平越高。为对给定主问题生成合理的噪声水平,我们基于基础问题与主问题的相似度对其进行排序,并将该排序问题建模为LASSO优化问题。此外,本文提出一种新型鲁棒性度量指标R_score,并构建了两个基础问题数据集,以标准化VQA模型鲁棒性分析。实验结果表明,所提出的评估方法能有效分析VQA模型的鲁棒性。同时,实验显示,利用基础问题链进行上下文学习可提升模型准确率。