Audio-only walking navigation can leave users disoriented, relying on vague cardinal directions and lacking real-time environmental context, leading to frequent errors. To address this, we present a novel system that integrates a Vision Language Model (VLM) with a spatial audio cue. Our system extracts environmental landmarks to anchor navigation instructions and, crucially, provides a directional spatial audio signal when the user faces the wrong direction, indicating the precise turn direction. In a user study (n=12), the spatial audio cue with VLM reduced route deviations compared to both VLM-only and Google Maps (audio-only) baseline systems. Users reported that the spatial audio cue effectively supported orientation and that landmark-anchored instructions provided a better navigation experience over audio-only Google Maps. This work serves as an initial look at the utility of future audio-only navigation systems for incorporating directional cues, especially real-time corrective spatial audio.
翻译:纯音频步行导航常使用户迷失方向,因其依赖模糊的方位指示且缺乏实时环境上下文,导致频繁出错。为解决这一问题,我们提出了一种集成视觉语言模型与空间音频提示的新型系统。该系统提取环境地标以锚定导航指令,并在用户朝向错误方向时提供关键性的定向空间音频信号,指示精确的转向方位。一项用户研究(n=12)表明,相较于仅使用视觉语言模型的系统及谷歌地图(纯音频)基线系统,结合空间音频提示与视觉语言模型的方案显著减少了路径偏离。用户反馈指出,空间音频提示能有效辅助定向,且基于地标锚定的指令相比纯音频版谷歌地图提供了更优的导航体验。本研究初步探索了未来纯音频导航系统整合定向提示——特别是实时校正性空间音频——的应用潜力。