We introduce Speech-to-Spatial, a referent disambiguation framework that converts verbal remote-assistance instructions into spatially grounded AR guidance. Unlike prior systems that rely on additional cues (e.g., gesture, gaze) or manual expert annotations, Speech-to-Spatial infers the intended target solely from spoken references (speech input). Motivated by our formative study of speech referencing patterns, we characterize recurring ways people specify targets (Direct Attribute, Relational, Remembrance, and Chained) and ground them to our object-centric relational graph. Given an utterance, referent cues are parsed and rendered as persistent in-situ AR visual guidance, reducing iterative micro-guidance ("a bit more to the right", "now, stop.") during remote guidance. We demonstrate the use cases of our system with remote guided assistance and intent disambiguation scenarios. Our evaluation shows that Speechto-Spatial improves task efficiency, reduces cognitive load, and enhances usability compared to a conventional voice-only baseline, transforming disembodied verbal instruction into visually explainable, actionable guidance on a live shared view.
翻译:我们提出Speech-to-Spatial,一种指代消解框架,可将远程协助中的口头指令转化为空间定位的增强现实引导。与依赖额外线索(如手势、视线)或人工专家标注的现有系统不同,Speech-to-Spatial仅通过语音输入中的指代信息推断目标对象。基于对语音指代模式的初步研究,我们归纳了人们指定目标对象的常见方式(直接属性描述、关系描述、记忆回溯和链式指代),并将其映射到以对象为中心的关系图中。给定话语输入,系统解析指代线索并生成长时驻留的原位增强现实视觉引导,从而减少远程协助过程中迭代的微观指引(如“再往右一点”“好,停”)。我们通过远程引导协助和意图消歧场景展示了系统的应用案例。评估结果表明,与传统的纯语音基线相比,Speech-to-Spatial显著提升了任务效率,降低了认知负荷,并改善了可用性,成功将脱离实体的口头指令转化为在实时共享视图上可直观解释、可操作执行的视觉引导。