Many everyday tasks rely on external tutorials such as manuals and videos, requiring users to constantly switch between reading instructions and performing actions, which disrupts workflow and increases cognitive load. Augmented reality (AR) enables in-situ guidance, while recent advances in large language models (LLMs) and vision-language models (VLMs) make it possible to automatically generate such guidance. However, existing AI-powered AR tutorial systems primarily focus on physical procedural tasks and provide limited support for hybrid physical and virtual workspaces. To address this gap, we conduct a formative study of cross-reality tasks and identify key requirements for state awareness and cross-reality coordination. We present JARVIS, a VLM-driven AR instruction system that generates contextual, step-by-step guidance from a single prompt, with real-time state verification and adaptive visual feedback. To inform the system design, we conducted a formative study to understand guidance needs across cross-reality tasks, which we categorize into four types, real-to-real (R2R), real-to-virtual (R2V), virtual-to-real (V2R), and virtual-to-virtual (V2V). A within-subjects study (N=14) across four domains shows JARVIS improves usability, workload, success rate, and visualization effectiveness over baselines.
翻译:许多日常任务依赖外部教程(如手册和视频),用户需在阅读指令与执行操作间持续切换,这打断了工作流程并增加了认知负荷。增强现实(AR)能够提供原位引导,而近期大语言模型(LLM)和视觉语言模型(VLM)的进展使得自动生成此类指导成为可能。然而,现有基于AI的AR教程系统主要聚焦于物理操作任务,对物理-虚拟混合工作空间的支持有限。为填补这一空白,我们通过跨现实任务的预研研究,识别出状态感知与跨现实协调的关键需求。我们提出JARVIS——一种基于VLM驱动的AR指导系统,该系统可通过单次提示生成上下文相关的分步指导,并具备实时状态验证与自适应视觉反馈功能。为设计该系统,我们开展了预研研究以理解跨现实任务中的指导需求,并将其归纳为四种类型:实境到实境(R2R)、实境到虚拟(R2V)、虚拟到实境(V2R)和虚拟到虚拟(V2V)。跨四个领域的被试内研究(N=14)表明,与基线方法相比,JARVIS在可用性、工作负荷、任务成功率及可视化有效性方面均有显著提升。