Semantic Flip: Synthetic OOD Generation for Robust Refusal in Embodied Question Answering and Spatial Localization

Detecting unanswerable user queries remains essential for the reliable deployment of real-world embodied agents. However, modern vision-language models (VLMs) often generate overly confident answers even when the available visual memory cannot support the query. Such overconfidence poses various task-dependent risks. The agent may provide misleading information to the user in Embodied Question Answering and select an arbitrary coordinate and physically guide the user there in spatial reasoning for navigation. Despite these high stakes, only a few prior studies directly address when and how an embodied VLM should respond with "I do not know." This work proposes Semantic Flip, a simple yet effective framework that synthesizes auxiliary out-of-distribution (OOD) samples for embodied refusal without requiring external OOD annotations. The key idea is to independently transform the query and video memory to construct auxiliary OOD pairs that lack sufficient visual grounding. These synthesized pairs enable training a lightweight rejection module on top of a frozen pretrained VLM. The module attaches to any existing VLM-based pipeline without retraining the underlying model. Across two complementary benchmarks, Semantic Flip consistently outperforms strong prompting baselines. This work also introduces SpaceReject, a new refusal benchmark for spatial localization with deliberately unanswerable queries over long video memory, where Semantic Flip achieves an $F_1$ score of 0.9559. The source codes and datasets are publicly available at https://github.com/ndb796/SemanticFlip.

翻译：检测无法回答的用户查询对于现实世界中具身智能体的可靠部署至关重要。然而，现代视觉语言模型（VLM）即使可用视觉记忆无法支持查询时，也常会生成过度自信的答案。这种过度自信会引发多种任务依赖的风险：在具身问答中，智能体可能向用户提供误导性信息；在空间推理导航中，智能体可能选择任意坐标并引导用户前往该处。尽管风险极高，但很少有研究直接探讨具身VLM应在何时以及如何响应"我不知道"这一问题。本文提出语义翻转（Semantic Flip），一个简单而有效的框架，通过合成辅助分布外（OOD）样本实现具身拒答，无需外部OOD标注。其核心思想是独立转换查询与视频记忆，构建缺乏足够视觉锚定的辅助OOD对。这些合成对能够训练出轻量级的拒答模块，该模块附加在冻结的预训练VLM之上，无需重新训练底层模型即可集成至任何基于VLM的流水线。在两个互补基准测试中，语义翻转一致优于强提示基线。本文还引入了SpaceReject，一个针对长视频记忆中故意不可回答查询的空间定位新拒答基准，语义翻转在此基准上取得了0.9559的$F_1$分数。源代码与数据集公开于https://github.com/ndb796/SemanticFlip。