This short paper presents a preliminary analysis of three popular Visual Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT, in the context of answering questions relating to driving scenarios. The performance of these models is evaluated by comparing the similarity of responses to reference answers provided by computer vision experts. Model selection is predicated on the analysis of transformer utilization in multimodal architectures. The results indicate that models incorporating cross-modal attention and late fusion techniques exhibit promising potential for generating improved answers within a driving perspective. This initial analysis serves as a launchpad for a forthcoming comprehensive comparative study involving nine VQA models and sets the scene for further investigations into the effectiveness of VQA model queries in self-driving scenarios. Supplementary material is available at https://github.com/KaavyaRekanar/Towards-a-performance-analysis-on-pre-trained-VQA-models-for-autonomous-driving.
翻译:本文对三种主流视觉问答模型——ViLBERT、ViLT和LXMERT——在驾驶场景问答任务中的性能进行了初步分析。通过将模型响应与计算机视觉专家提供的参考答案进行相似度比较,评估了这些模型的性能表现。模型选择基于对多模态架构中Transformer应用方式的分析。结果表明,融合跨模态注意力机制与后期融合技术的模型在驾驶场景问答中展现出生成更优答案的潜力。本项初步研究为后续涵盖九种视觉问答模型的全面对比研究奠定了基础,并为深入探索视觉问答模型查询在自动驾驶场景中的有效性铺平了道路。补充材料参见:https://github.com/KaavyaRekanar/Towards-a-performance-analysis-on-pre-trained-VQA-models-for-autonomous-driving。