Augmented reality (AR) requires the seamless integration of visual, auditory, and linguistic channels for optimized human-computer interaction. While auditory and visual inputs facilitate real-time and contextual user guidance, the potential of large language models (LLMs) in this landscape remains largely untapped. Our study introduces an innovative method harnessing LLMs to assimilate information from visual, auditory, and contextual modalities. Focusing on the unique challenge of task performance quantification in AR, we utilize egocentric video, speech, and context analysis. The integration of LLMs facilitates enhanced state estimation, marking a step towards more adaptive AR systems. Code, dataset, and demo will be available at https://github.com/nguyennm1024/misar.
翻译:增强现实技术需要视觉、听觉与语言通道的无缝融合,以实现优化的人机交互。尽管听觉与视觉输入能实现实时情境化用户引导,但大型语言模型在该领域的潜力尚未得到充分发掘。本研究提出了一种创新方法,利用大型语言模型整合来自视觉、听觉与情境模态的信息。针对增强现实中任务性能量化的独特挑战,我们采用自我中心视频、语音及情境分析技术。通过集成大型语言模型来增强状态估计能力,标志着向更具自适应性的增强现实系统迈出重要一步。相关代码、数据集及演示文档将发布于https://github.com/nguyennm1024/misar。