Large vision-language models (VLMs) have improved embodied question answering (EQA) agents by providing strong semantic priors for open-vocabulary reasoning. However, when used directly for step-level exploration, VLMs often exhibit frontier oscillations, unstable back-and-forth movements caused by overconfidence and miscalibration, leading to inefficient navigation and degraded answer quality. We propose Prune-Then-Plan, a simple and effective framework that stabilizes exploration through step-level calibration. Instead of trusting raw VLM scores, our method prunes implausible frontier choices using a Holm-Bonferroni inspired pruning procedure and then delegates final decisions to a coverage-based planner. This separation converts overconfident predictions into conservative, interpretable actions by relying on human-level judgments to calibrate the step-level behavior of VLMs. Integrated into the 3D-Mem EQA framework, our approach achieves relative improvements of up to 49% and 33% in visually grounded SPL and LLM-Match metrics respectively over baselines. Overall, our method achieves better scene coverage under equal exploration budgets on both OpenEQA and EXPRESS-Bench datasets.
翻译:大型视觉语言模型(VLMs)通过为开放词汇推理提供强大的语义先验,改进了具身问答(EQA)智能体。然而,当直接用于步骤级探索时,VLM 常表现出前沿振荡,即由过度自信和校准不当导致的不稳定来回移动,从而导致导航效率低下和答案质量下降。我们提出了 Prune-Then-Plan,一个通过步骤级校准来稳定探索的简单而有效的框架。我们的方法不依赖原始的 VLM 评分,而是使用一种受 Holm-Bonferroni 启发的剪枝程序来剔除不合理的前沿选择,然后将最终决策委托给基于覆盖率的规划器。这种分离通过依赖人类水平的判断来校准 VLM 的步骤级行为,将过度自信的预测转化为保守的、可解释的动作。集成到 3D-Mem EQA 框架中后,我们的方法在视觉基础 SPL 和 LLM-Match 指标上分别比基线实现了高达 49% 和 33% 的相对提升。总体而言,在 OpenEQA 和 EXPRESS-Bench 数据集上,我们的方法在同等探索预算下实现了更好的场景覆盖率。