We introduce Speech-to-Spatial, a referent disambiguation framework that converts verbal remote-assistance instructions into spatially grounded AR guidance. Unlike prior systems that rely on additional cues (e.g., gesture, gaze) or manual expert annotations, Speech-to-Spatial infers the intended target solely from spoken references (speech input). Motivated by our formative study of speech referencing patterns, we characterize recurring ways people specify targets (Direct Attribute, Relational, Remembrance, and Chained) and ground them to our object-centric relational graph. Given an utterance, referent cues are parsed and rendered as persistent in-situ AR visual guidance, reducing iterative micro-guidance ("a bit more to the right", "now, stop.") during remote guidance. We demonstrate the use cases of our system with remote guided assistance and intent disambiguation scenarios. Our evaluation shows that Speechto-Spatial improves task efficiency, reduces cognitive load, and enhances usability compared to a conventional voice-only baseline, transforming disembodied verbal instruction into visually explainable, actionable guidance on a live shared view.
翻译:我们提出Speech-to-Spatial(语音到空间),一种将远程口头辅助指令转化为空间锚定增强现实引导的指代消歧框架。与依赖额外线索(如手势、视线)或人工专家标注的现有系统不同,Speech-to-Spatial仅通过语音输入中的口头指代推断目标对象。基于我们关于语音指代模式的形成性研究,我们归纳了人们指定目标时的重复性方式(直接属性、关系型、回忆型与链式),并将其锚定到以对象为中心的关系图中。给定一段话语,系统解析其中的指代线索并渲染为持续的原位AR视觉引导,从而减少远程引导过程中重复的微观指令(如"再往右一点"、"现在停下")。我们通过远程辅助引导与意图消歧场景展示了系统的应用案例。评估表明,相较于传统纯语音基线,Speech-to-Spatial在提升任务效率、降低认知负荷与增强可用性的同时,将无形的口头指令转化为实时共享视图中可视化、可执行的引导信息。