In light of the dynamic nature of autonomous driving environments and stringent safety requirements, general MLLMs combined with CLIP alone often struggle to represent driving-specific scenarios accurately, particularly in complex interactions and long-tail cases. To address this, we propose the Hints of Prompt (HoP) framework, which introduces three key enhancements: Affinity hint to emphasize instance-level structure by strengthening token-wise connections, Semantic hint to incorporate high-level information relevant to driving-specific cases, such as complex interactions among vehicles and traffic signs, and Question hint to align visual features with the query context, focusing on question-relevant regions. These hints are fused through a Hint Fusion module, enriching visual representations and enhancing multimodal reasoning for autonomous driving VQA tasks. Extensive experiments confirm the effectiveness of the HoP framework, showing it significantly outperforms previous state-of-the-art methods across all key metrics.
翻译:鉴于自动驾驶环境的动态特性与严格的安全要求,仅结合CLIP的通用多模态大语言模型往往难以准确表征驾驶专用场景,尤其在复杂交互与长尾案例中表现不足。为此,我们提出提示线索框架,该框架引入三项关键增强机制:亲和性线索通过强化令牌间连接以突出实例级结构,语义性线索融入与驾驶场景相关的高层信息(如车辆与交通标志间的复杂交互),问题性线索则使视觉特征与查询语境对齐,聚焦于问题相关区域。这些线索通过线索融合模块进行整合,从而丰富视觉表征并增强自动驾驶视觉问答任务中的多模态推理能力。大量实验证实了提示线索框架的有效性,其在所有关键指标上均显著优于以往最先进方法。