Visual Question Answering (VQA) models play a critical role in enhancing the perception capabilities of autonomous driving systems by allowing vehicles to analyze visual inputs alongside textual queries, fostering natural interaction and trust between the vehicle and its occupants or other road users. This study investigates the attention patterns of humans compared to a VQA model when answering driving-related questions, revealing disparities in the objects observed. We propose an approach integrating filters to optimize the model's attention mechanisms, prioritizing relevant objects and improving accuracy. Utilizing the LXMERT model for a case study, we compare attention patterns of the pre-trained and Filter Integrated models, alongside human answers using images from the NuImages dataset, gaining insights into feature prioritization. We evaluated the models using a Subjective scoring framework which shows that the integration of the feature encoder filter has enhanced the performance of the VQA model by refining its attention mechanisms.
翻译:视觉问答(VQA)模型通过使车辆能够同时分析视觉输入和文本查询,在提升自动驾驶系统感知能力方面发挥着关键作用,从而促进车辆与其乘员或其他道路使用者之间的自然交互与信任。本研究探究了人类与VQA模型在回答驾驶相关问题时注意力模式的差异,揭示了所观察对象的差异性。我们提出了一种集成过滤器的方法,以优化模型的注意力机制,优先关注相关对象并提高准确性。以LXMERT模型为例,我们对比了预训练模型与过滤器集成模型的注意力模式,同时利用NuImages数据集中的图像分析人类回答,从而深入理解特征优先级排序。我们采用主观评分框架对模型进行评估,结果表明,特征编码器过滤器的集成通过优化注意力机制提升了VQA模型的性能。