The progression to "Pervasive Augmented Reality" envisions easy access to multimodal information continuously. However, in many everyday scenarios, users are occupied physically, cognitively or socially. This may increase the friction to act upon the multimodal information that users encounter in the world. To reduce such friction, future interactive interfaces should intelligently provide quick access to digital actions based on users' context. To explore the range of possible digital actions, we conducted a diary study that required participants to capture and share the media that they intended to perform actions on (e.g., images or audio), along with their desired actions and other contextual information. Using this data, we generated a holistic design space of digital follow-up actions that could be performed in response to different types of multimodal sensory inputs. We then designed OmniActions, a pipeline powered by large language models (LLMs) that processes multimodal sensory inputs and predicts follow-up actions on the target information grounded in the derived design space. Using the empirical data collected in the diary study, we performed quantitative evaluations on three variations of LLM techniques (intent classification, in-context learning and finetuning) and identified the most effective technique for our task. Additionally, as an instantiation of the pipeline, we developed an interactive prototype and reported preliminary user feedback about how people perceive and react to the action predictions and its errors.
翻译:迈向“普适增强现实”的愿景是让用户能够持续便捷地获取多模态信息。然而,在日常场景中,用户常因身体、认知或社交因素而处于忙碌状态。这可能导致用户在将外界多模态信息转化为行动时存在阻碍。为减轻此类障碍,未来的交互界面应具备智能性,能根据用户情境快速提供数字行为入口。为探索潜在的数字行为范畴,我们开展了一项日志研究,要求参与者捕捉并分享其意图执行行为的媒体(如图像或音频),同时记录期望行为及情境信息。基于该数据,我们构建了针对不同类型多模态感官输入的数字后续行为整体设计空间。随后,我们设计了OmniActions——一个基于大语言模型(LLM)的流程,能够处理多模态感官输入,并基于既定设计空间预测目标信息的后续行为。利用日志研究收集的实证数据,我们对三种LLM技术变体(意图分类、情境学习与微调)进行了定量评估,并确定了适用于本任务的最优技术方案。此外,作为该流程的具体实现,我们开发了交互原型,并汇报了用户对行为预测及系统错误感知与反应的初步反馈。