Many everyday tasks rely on external tutorials such as manuals and videos, requiring users to constantly switch between reading instructions and performing actions, which disrupts workflow and increases cognitive load. Augmented reality (AR) enables in-situ guidance, while recent advances in large language models (LLMs) and vision-language models (VLMs) make it possible to automatically generate such guidance. However, existing AI-powered AR tutorial systems primarily focus on physical procedural tasks and provide limited support for hybrid physical and virtual workspaces. To address this gap, we conduct a formative study of cross-reality tasks and identify key requirements for state awareness and cross-reality coordination. We present JARVIS, a VLM-driven AR instruction system that generates contextual, step-by-step guidance from a single prompt, with real-time state verification and adaptive visual feedback. To inform the system design, we conducted a formative study to understand guidance needs across cross-reality tasks, which we categorize into four types, real-to-real (R2R), real-to-virtual (R2V), virtual-to-real (V2R), and virtual-to-virtual (V2V). A within-subjects study (N=14) across four domains shows JARVIS improves usability, workload, success rate, and visualization effectiveness over baselines.
翻译:许多日常任务依赖外部教程(如手册和视频),要求用户持续在阅读指令与执行操作之间切换,这打断了工作流程并增加了认知负荷。增强现实(AR)能够实现原位指导,而大规模语言模型和视觉语言模型的最新进展使得自动生成此类指导成为可能。然而,现有基于AI的AR教程系统主要关注物理程序性任务,对混合物理与虚拟工作空间的支持有限。为弥补这一空白,我们开展了跨现实任务的形成性研究,确定了状态感知与跨现实协调的关键需求。我们提出JARVIS——一种VLM驱动的AR指令系统,可从单一提示生成上下文感知的逐步指导,并具备实时状态验证与自适应视觉反馈功能。为支撑系统设计,我们通过形成性研究理解了跨现实任务(可分为四类:实到实、实到虚、虚到实、虚到虚)中的指导需求。一项跨四个领域的受试者内实验(N=14)表明,JARVIS在可用性、工作负荷、成功率及可视化有效性方面均优于基线系统。