Many everyday tasks rely on external tutorials such as manuals and videos, requiring users to constantly switch between reading instructions and performing actions, which disrupts workflow and increases cognitive load. Augmented reality (AR) enables in-situ guidance, while recent advances in large language models (LLMs) and vision-language models (VLMs) make it possible to automatically generate such guidance. However, existing AI-powered AR tutorial systems primarily focus on physical procedural tasks and provide limited support for hybrid physical and virtual workspaces. To address this gap, we conduct a formative study of cross-reality tasks and identify key requirements for state awareness and cross-reality coordination. We present JARVIS, a VLM-driven AR instruction system that generates contextual, step-by-step guidance from a single prompt, with real-time state verification and adaptive visual feedback. To inform the system design, we conducted a formative study to understand guidance needs across cross-reality tasks, which we categorize into four types, real-to-real (R2R), real-to-virtual (R2V), virtual-to-real (V2R), and virtual-to-virtual (V2V). A within-subjects study (N=14) across four domains shows JARVIS improves usability, workload, success rate, and visualization effectiveness over baselines.
翻译:许多日常任务依赖外部教程(如手册和视频),导致用户需频繁切换阅读指令与执行操作,破坏工作流程并增加认知负荷。增强现实技术可提供原位引导,而大语言模型与视觉-语言模型的最新进展使得自动生成此类引导成为可能。然而,现有基于AI的增强现实教程系统主要关注物理流程性任务,对混合物理与虚拟工作空间的支撑能力有限。为弥补这一空白,我们开展了跨现实任务的形成性研究,识别出状态感知与跨现实协调的关键需求。我们提出JARVIS——一种视觉-语言模型驱动的增强现实指令系统,该系统可从单一提示生成上下文相关的分步引导,并具备实时状态验证与自适应视觉反馈功能。为支撑系统设计,我们开展了形成性研究以理解跨现实任务的引导需求,将其划分为四类:真实到真实、真实到虚拟、虚拟到真实、虚拟到虚拟。跨四个领域的被试内研究(N=14)表明,JARVIS在可用性、认知负荷、成功率及可视化效果方面均优于基线系统。