This short paper presents a preliminary analysis of three popular Visual Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT, in the context of answering questions relating to driving scenarios. The performance of these models is evaluated by comparing the similarity of responses to reference answers provided by computer vision experts. Model selection is predicated on the analysis of transformer utilization in multimodal architectures. The results indicate that models incorporating cross-modal attention and late fusion techniques exhibit promising potential for generating improved answers within a driving perspective. This initial analysis serves as a launchpad for a forthcoming comprehensive comparative study involving nine VQA models and sets the scene for further investigations into the effectiveness of VQA model queries in self-driving scenarios. Supplementary material is available at https://github.com/KaavyaRekanar/Towards-a-performance-analysis-on-pre-trained-VQA-models-for-autonomous-driving.
翻译:本文简要分析了三种主流视觉问答(VQA)模型——ViLBERT、ViLT和LXMERT——在驾驶场景问答任务中的表现。通过将模型生成的回答与计算机视觉专家提供的参考答案进行相似度比较,评估了这些模型的性能。模型选择基于对多模态架构中Transformer使用情况的分析。结果表明,融合跨模态注意力机制与晚期融合技术的模型在驾驶视角下展现出生成更优回答的潜力。本项初步分析为后续包含九种VQA模型的全面对比研究奠定了基础,并为进一步探究VQA模型查询在自动驾驶场景中的有效性提供了方向。补充材料详见:https://github.com/KaavyaRekanar/Towards-a-performance-analysis-on-pre-trained-VQA-models-for-autonomous-driving。