Cross-Stage Coherence in Hierarchical Driving VQA: Explicit Baselines and Learned Gated Context Projectors

Graph Visual Question Answering (GVQA) for autonomous driving organizes reasoning into ordered stages, namely Perception, Prediction, and Planning, where planning decisions should remain consistent with the model's own perception. We present a comparative study of cross-stage context passing on DriveLM-nuScenes using two complementary mechanisms. The explicit variant evaluates three prompt-based conditioning strategies on a domain-adapted 4B VLM (Mini-InternVL2-4B-DA-DriveLM) without additional training, reducing NLI contradiction by up to 42.6% and establishing a strong zero-training baseline. The implicit variant introduces gated context projectors, which extract a hidden-state vector from one stage and inject a normalized, gated projection into the next stage's input embeddings. These projectors are jointly trained with stage-specific QLoRA adapters on a general-purpose 8B VLM (InternVL3-8B-Instruct) while updating only approximately 0.5% of parameters. The implicit variant achieves a statistically significant 34% reduction in planning-stage NLI contradiction (bootstrap 95% CIs, p < 0.05) and increases cross-stage entailment by 50%, evaluated with a multilingual NLI classifier to account for mixed-language outputs. Planning language quality also improves (CIDEr +30.3%), but lexical overlap and structural consistency degrade due to the absence of driving-domain pretraining. Since the two variants use different base models, we present them as complementary case studies: explicit context passing provides a strong training-free baseline for surface consistency, while implicit gated projection delivers significant planning-stage semantic gains, suggesting domain adaptation as a plausible next ingredient for full-spectrum improvement.

翻译：图视觉问答（GVQA）在自动驾驶中将推理组织为有序阶段，即感知、预测与规划，其中规划决策需与模型自身的感知保持一致。本文通过两种互补机制，对DriveLM-nuScenes数据集中的跨阶段上下文传递进行了比较研究。显式变体在领域适配的4B视觉语言模型（Mini-InternVL2-4B-DA-DriveLM）上评估了三种基于提示的条件化策略，无需额外训练即可将自然语言推理矛盾率降低至多42.6%，建立了强大的无训练基线。隐式变体引入门控上下文投影器，从某一阶段提取隐藏状态向量，并将归一化门控投影注入下一阶段的输入嵌入。这些投影器与各阶段的专用QLoRA适配器在通用型8B视觉语言模型（InternVL3-8B-Instruct）上联合训练，仅更新约0.5%的参数。隐式变体实现了规划阶段自然语言推理矛盾的统计显著降低34%（Bootstrap 95%置信区间，p<0.05），并通过多语言NLI分类器评估（以处理混合语言输出），使跨阶段蕴含关系提升50%。规划语言质量亦有提升（CIDEr +30.3%），但因缺乏驾驶领域预训练，词汇重叠与结构一致性指标出现下降。由于两种变体采用不同的基础模型，本文将其作为互补案例研究：显式上下文传递为表层一致性提供了无需训练的强大基线，而隐式门控投影在规划阶段实现了显著的语义增益，表明领域适配可能成为实现全谱系改进的下一个关键要素。