WalkGPT: Grounded Vision-Language Conversation with Depth-Aware Segmentation for Pedestrian Navigation

Ensuring accessible pedestrian navigation requires reasoning about both semantic and spatial aspects of complex urban scenes, a challenge that existing Large Vision-Language Models (LVLMs) struggle to meet. Although these models can describe visual content, their lack of explicit grounding leads to object hallucinations and unreliable depth reasoning, limiting their usefulness for accessibility guidance. We introduce WalkGPT, a pixel-grounded LVLM for the new task of Grounded Navigation Guide, unifying language reasoning and segmentation within a single architecture for depth-aware accessibility guidance. Given a pedestrian-view image and a navigation query, WalkGPT generates a conversational response with segmentation masks that delineate accessible and harmful features, along with relative depth estimation. The model incorporates a Multi-Scale Query Projector (MSQP) that shapes the final image tokens by aggregating them along text tokens across spatial hierarchies, and a Calibrated Text Projector (CTP), guided by a proposed Region Alignment Loss, that maps language embeddings into segmentation-aware representations. These components enable fine-grained grounding and depth inference without user-provided cues or anchor points, allowing the model to generate complete and realistic navigation guidance. We also introduce PAVE, a large-scale benchmark of 41k pedestrian-view images paired with accessibility-aware questions and depth-grounded answers. Experiments show that WalkGPT achieves strong grounded reasoning and segmentation performance. The source code and dataset are available on the \href{https://sites.google.com/view/walkgpt-26/home}{project website}.

翻译：确保无障碍行人导航需要对复杂城市场景的语义与空间两方面进行推理，这是现有大型视觉-语言模型（LVLM）难以应对的挑战。尽管这些模型能够描述视觉内容，但其缺乏显式接地机制，导致物体幻觉和不可靠的深度推理，限制了其在无障碍导航指导中的实用性。本文提出WalkGPT，一种面向新型任务“接地导航引导”的像素级接地LVLM，它将语言推理与分割统一于单一架构中，实现深度感知的无障碍导航引导。给定行人视角图像和导航查询，WalkGPT能生成包含分割掩码的对话式响应，这些掩码可勾勒出无障碍特征与危险区域，并附带相对深度估计。该模型包含多尺度查询投影器（MSQP）——通过沿空间层级聚合文本标记来重塑最终图像标记，以及校准文本投影器（CTP）——在提出的区域对齐损失指导下将语言嵌入映射为分割感知表示。这些组件无需用户提供线索或锚点即可实现细粒度接地与深度推理，使模型能生成完整且真实的导航指引。我们还提出了PAVE基准数据集，包含41,000张行人视角图像及与之配对的无障碍感知问题与深度接地答案。实验表明，WalkGPT在接地推理与分割任务上均表现出色。源代码与数据集已发布于\href{https://sites.google.com/view/walkgpt-26/home}{项目网站}。